Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Diagnosing Obamacare Website Performance Issues with APM Tools

Many Americans are looking forward to the new Healthcare website that allows them to select the health insurance best for their needs. As with any new website that has been anticipated by a lot of people, it was not a big surprise there were glitches when millions of citizens tried to use the new portal after its launch.

Now – there are many different reasons why websites that need to scale for that many users don’t deliver on the promise of good end user experience. A general “cultural” problem is that performance and scalability are pushed towards the end in favor of more functionality resulting in problems that don’t allow the end user to consume these great features. Changing this culture with the support of tools that integrate in your continuous delivery process is mandatory to avoid these types of problems. We have blogged about this in the past based on discussions we had with companies that made that transitions.

Let’s put the spotlight back on this key Obamacare website.  At the time of this posting, I’m not privy to data collected from within the site’s production datacenter environment. That kind of code-level visibility would help us find the root cause of problems immediately.  However, we can still learn quite a bit by using our freely available tools that analyze how web pages and third party objects are rendered within real end-user browsers.  Our goal is to quickly find the top problem patterns that help us figure out why so many people are complaining.

The Analysis

One of my colleagues walked through different use case scenarios on healthcare.gov and sent me his AJAX Edition session files for analysis. Here is an overview that shows that most pages lack basic WPO (Web Performance Optimization) aspects:

Bad WPO ranking on most pages and lacking behind the TOP US healthcare sites

Bad WPO ranking on most pages and lacking behind the TOP US healthcare sites

Now let’s have a closer look and highlight the key observations from these sessions.

Observation #1: Homepage impacted by Initial HTML and 3rd Party Content

Looking at the Timeline View for loading healthcare.gov shows some very interesting things. It takes a long time to download the seemingly small 59k initial HTML document. This is probably caused by bandwidth constraints on its web server, as most of this time is contributed to the server response time and not the network. What is even more interesting is the use of multiple different 3rd party monitoring solutions such as Google, Chartbeat and Pingdom. All of these 3rd party components need to load JavaScript files that impact the initial page load time.

Good news is the first visual impression is good with being under 2 seconds. But – there is definitely the option to optimize overall page load by optimizing the use of 3rd party components and making sure the initial HTML page can be served faster from the web servers.

Most impact by slow download of HTML page as well as lots of content from many 3rd party components

Most impact by slow download of HTML page as well as lots of content from many 3rd party components

 

Observation #2: NO CSS and JS Merging on Registration Page

The registration page – https://www.healthcare.gov/marketplace/global/en_US/registration – is actually a very bad example of some of the well-known practices of Web Performance Optimizations. It seems they forgot to merge CSS and JS files together as they are currently loading about 55! Individual JavaScript files and 11! Individual CSS files! The following screenshot shows the list of individual resources that could all be merged – especially because many of them actually belong logically together as well, e.g.: jQuery and jQuery Plugins.

Loading too many small JS and CSS files instead of merging them together. This results in too many roundtrips to the web server

Loading too many small JS and CSS files instead of merging them together. This results in too many roundtrips to the web server

Besides not merging files they also use some non-minified versions of the jQuery plugins and use rather large and uncompressed versions of their images.

Observation #3: Server-Side issues with AJAX calls on Profile Page

The most obvious end user performance impact was seen on the My Profile page. We could see a 16.8 second server response time for an AJAX call that returned some basic user information for the logged in user. The end user has to sit and wait until that AJAX request finally completes so that the page is fully operational. The timeline shows how much impact this AJAX request really has:

While waiting on the AJAX Response the web UI is “blocked” and user has to wait

While waiting on the AJAX Response the web UI is “blocked” and user has to wait

What’s even more interesting is that every interaction on the MyProfile page re-sends this AJAX Request returning basically the same information again without caching it. Taking a closer look at the actual content that is returned, it seems that about 95% of the content is always the same (e.g: name, phone number, …). There is only one field in that returned JSON object that actually changes. The question is whether this can be optimized to cache that static information (Name, Phone, …) and only return the dynamic values. This would also speed up server-side processing, reduces network package size and improves end user response time:

This AJAX call takes up to 16.8s on the profile page and is executed for every user interaction. Optimizing this logic will improve end user response time and takes pressure from the application servers

This AJAX call takes up to 16.8s on the profile page and is executed for every user interaction. Optimizing this logic will improve end user response time and takes pressure from the application servers

Observation #4: Heavy JavaScript Processing by Backbone, Underscore, … JS Libraries

Filling out an application uses a lot of JavaScript to deal with validating input as well as with presenting the results. The usage of 3rd party JavaScript libraries is very common as it takes a lot of work off UI developers. On the other side these frameworks might not always work perfectly for a particular application. Looking at the JavaScript hotspots on that particular page shows some very significant hotspots by the click event handlers for things like Add Deductions, Annual Income Information or simply clicking on the Accept Warning and Continue. Response times of up to 7 seconds were mainly caused by these JavaScript frameworks that iterate through the whole DOM to identify those elements that need to be modified or not. Depending on the browsers JavaScript engine this can have a significant performance impact as it was in the case of my colleague.

The more complex a page the more effort for JS Frameworks to dynamically iterate through the DOM. In this case we have several seconds of pure JavaScript Execution Time to modify the DOM

The more complex a page the more effort for JS Frameworks to dynamically iterate through the DOM. In this case we have several seconds of pure JavaScript Execution Time to modify the DOM

The ToDo List in order to “Get Well”

Looking at these observations there are several points for a ToDo List in order to make the site faster and more responsive:

  • Rethink the 3rd Party components. Delay loading of these components if possible to not impact initial page load time
  • Use minified JavaScript and CSS Files to save on bandwidth
  • Merge JS and CSS files to reduce the number of roundtrips
  • Compress Images to save on network bandwidth
  • Optimize the server-side performance for these long running AJAX Requests
  • Make the pages “slimmer” (less DOM elements) to speed up JavaScript executions that iterate through the DOM

In general, it is important that you optimize your website for performance based on the anticipated load from its inception, to the website design, and all the way through the development phase and roll-out.  This is not only important for application launches but anytime you have higher peak load events, which we call “Rush events”.  You not only need the solutions, but the framework and step by step best practices to plan for peak events or launches.  Compuware has developed best practices guides, customer stories of success and expert advice around optimization in performance in a free program called “The Compuware APM Rush Program.”  Visit this site for access to the expert advice and methodologies to ensure your launch or rush event goes flawlessly.

More Things to Consider …

Those of you that know our blog are familiar with all the problem patterns we have talked about in the past. This list of problem patterns just adds to the list – but in fact – there is not a whole lot of new concepts here. It seems that we keep seeing the same problem patterns over and over again. If you are responsible for end user experience, performance or scalability of your Web, Mobile, Desktop or any type of other application that needs to deliver value to the end user have a look at our top blog posts such as Top 8 Performance Landmines, Balancing the Load or The DevOps way to Solving JVM Memory Issues.

Comments

  1. Awesome review of the site with simple steps to correct. I’ve forwarded the URL to some folks at CMS & will let you know if there is any feedback.

    Nice work!

    John Haughton MD, MS
    CMIO
    Covisint

  2. received confirmation that the blog has been passed on to the web team at healthcare.gov. — John

  3. Charlie Thompson says:

    Nice to see that my colleagues at Compuware have lost neither their technical edge nor their sense of diplomacy. I myself was tempted to offer some Arbor Networks services on the assumption that the Tea Party was conducting a DDOS assault. All the best, Charlie

  4. Great analysis and write-up. Plus you were able to provide this while only using tools from outside the data center. As you said – “That kind of code-level visibility would help us find the root cause of problems immediately.”

    So let’s get DC-RUM and DynaTrace inside the data center and solve some problems.

    Any updates from the meeting with healthcare.gov?

    Brad Wilson
    QoS Networking, Inc.
    Compuware Business Partner

  5. Don Brumfield says:

    Great article and assessment – all with EXTERNAL tools! I am also heartened that CMS is willing to solicit input and help to resolve this ASAP.

    Can you share any feedback from you CMS call?

    Thanks & Keep up the great work,
    DOn

    • Yeah – it is interesting what one can do with analyzing things like this from “the outside”. Imagine what you can do when you also sit IN the application. I will keep you posted on the progress – keep your eyes open for hopefully a couple of follow-up blog posts

  6. Jeff Winston says:

    Another testimonial of why NOT to use hackerscript, and, all the
    gelatinous bloated libs to develop a robust scalable RIA.

    Flex 4/AS3/J2EE == We’re not having this discussion.

  7. Xian Gnats says:

    Seriously Charlie, “on the assumption that the Tea Party was conducting a DDOS assault”? You really showed your lack of true intellect on that.

    Great detection guys. Curious to see how long this will take to resolve.

  8. “I myself was tempted to offer some Arbor Networks services on the assumption that the Tea Party was conducting a DDOS assault.”

    Go get ‘em, Charlie!

    *rolls eyes*

  9. 500 million and preschool web basics aren’t done. I can only imagine what the back-end where the real workload happens looks like. :( sad day. Also why are they even paying these jokers if they still need hand holding to figure this stuff out?

  10. Siamak. S. says:

    The major problem with every newly established and very well advertised service is the initial overcrowding. This is the same thing which was happened to BlackBerry messaging service for apple and android phones.

    You design a website for 100,000 concurrent users (which based on the estimates is a very good number) but at the initial launch 2 million concurrent users start hammering the website. After the initial registration etc. are finished the demand returns to what you expected (i.e. 100,000).

    The fact is that we cannot design the specific service for the 2,000,000 concurrent users, which happens a few times a year. Exactly the same, we cannot afford to build city roads for the burst traffic of morning and afternoon since that would cost 10 times more (+maintenance) and the excess capacity will remain dormant the other times.

    Black berry applied a queuing method in which you would enter your email, your name would enter a queue and users would be informed gradually (when the servers’ load would allow).

  11. @Siamak — sure you can build for burst loads, quite a few vendors have HARDWARE solutions for that. IBM POWER has dormant processors that can pick up extra load and be off for regular loads. You can also use a HP, IBM, Dell, Cisco/NetApp FLexpod, or VCECloud that bursts out from a private solution to a public cloud such as Amazon, Rackspace, etc when demand requires. So there isn’t an excuse for lack of scaling of resources to support the needs of the burst. HOWEVER, it doesn’t excuse the really poorly written software. Obviously this code was developed by a very junior team which lacked technical leadership and/or a good Code Review/QA process. And of course no performance testing was done. Compuware tools would have been super helpful to find these issues. With only front end access the tools found many signficant issues, just think what lurks behind in the core systems code!
    It’s already been revealed by other blogs that the design is very flawed with a single choke point for ALL external systems communications, again running performance tools would have found that problem. I’ve been in software/systems work 32 yrs and I have yet to see an application so poorly written and designed. Tools are a great help but I suspect a much worse problem which is a “take the money and run” implementation where the code is thrown together with very simple testing, deployed and the contractors slink away in the night and someone else gets paid to fix the problems. We saw someo of this in the “dot com” bubble but this is the most egregious example.

  12. sujatha lilly says:

    This site is built using Curam framework. There are no curam experts at the development team. They have some curam certified professionals (Hourly rate $175) but of no use. Basic knowledge lacking. First they have to learn IBM Curam then they can play around the medicare. Instead they could have started the development from the scratch using the open source standard and stabilized frameworks.

Comments

*


6 − four =