Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

How better Caching helps Frankfurt’s Airport Website to handle additional load caused by the Volcano

Along with so many others I am stranded in Europe waiting for my flight back to the United States right now. The Volcano not only impacts flights across Europe but also impacts web sites of airports, airlines and travel agencies around the world. Checking my flight status on Sunday was almost impossible. The website of Germans largest airport – Frankfurt am Main – was hardly reachable. No wonder as I assume that their page just got hammered by thousands of additional page requests of frustrated travellers. Now it’s Tuesday and the website is back to “almost acceptable” response times. Time for me to analyze the current web site as I’ve done with others such as vancouver2010, utah.travel.com or masters.com.

Status Quo: Too many resources and wrong cache settings

Using the FREE dynaTrace AJAX Edition and browsing to http://www.frankfurt-airport.com shows me what is going-on on this home page. The Resource Graph shows me the number of JavaScript, CSS, Image files, … On the home page we find 97 images, 40 JavaScript and 22 Style Sheet files. I’ve browsed to the homepage before – that’s why some of these resources show up as coming from the Cache. However – as we will see in a bit – the current Cache settings still require my browser to send a request.

Too many resources on the website (97 images, 40 js files, 22 css files, ...)

Too many resources on the website (97 images, 40 js files, 22 css files, …)

Drilling to the TimeLine View shows where these resources are downloaded from and how the impact page load time. Like many similar web-sites, this content is delivered from many different domains. In this case we see 28 domains delivering Ad’s, Banners or to provide services such as web user tracking. We see that it takes 11 seconds for the onLoad event to be triggered – that is when all initial content is downloaded (HTML + referenced objects). Most of the download time is spent on content delivered by www.frankfurt-airport.com. Most of the images, JavaScript and CSS files are downloaded from this domain.

Due to the physical network connection limitation the browser only uses 2 physical connections to download these resources resulting in ~7 seconds of pure download time from this domain. Using multiple domains – a technique called Domain Sharding – allows the browser to use more physical connections to download these resources in parallel. This ultimately results in faster page-load time. The other point worth noting is the number of files that are downloaded. 125 resources are downloaded from the main domain until the onLoad event is triggered. By merging JavaScript and CSS Files and by Spriting image files (where possible) this number can be reduced – big time – resulting in fewer round trips and therefore speeding up page load time as well.

Content is delivered by 28 web domains. Most images are severed from the frankfurt-airport.com domain slowing page load load

Content is delivered by 28 web domains. Most images are severed from the frankfurt-airport.com domain slowing page load load

My next step is to take a closer look at caching. Browsers can cache content such as static images, styles sheets or JavaScript files. This makes perfect sense for content that doesn’t change frequently. In order to verify correct cache settings I record another session by browsing to the home page a second time. If caching is configured correctly my browser should not retrieve certain resources from the server but just take it from the local browser cache. The Summary View looks good – seems that most of the resources are actually retrieved from the Cache:

Most of the images, css and javascript files are now taken from the browser cache

Most of the images, css and JavaScript files are now taken from the browser cache

Looks good – but wait. Let’s not be deceived. We still see a very high value on Server Transfer time. Based on my experience this means that – even though content is retrieved from the cache – the browser sent HTTP requests to the server for each individual resource to “ask” whether the content has been modified (IF-MODIFIED-SINCE) since the last time the resource was downloaded. This is OK if I haven’t checked the web-site for weeks or months, but it is not ok if the last page visit has only been minutes ago.

A closer look at the Network Requests view reveals the problem. The Expires Header is actually set “to the past”. I recorded my session on April 20th 2010 at 09:38GMT. The Expires header is set to April 19th – that was yesterday. That is the reason why my browser has to send an HTTP Request to the server for every “cached” element to check if there is a newer version of the resource on the server or not. The Server Column shows us how much time is spent for each request on the server to determine whether the resource has changed or not. The Wait column tells us how long individual requests had to wait to be processed (this is again caused by the physical network connection limitation – only 2 physical connections are available for a domain – all other requests have to wait).

Expires header in the past causes browser to send IF-MODIFIED-SINCE requests for every cached resource

Expires header in the past causes browser to send IF-MODIFIED-SINCE requests for every cached resource

The Network view shows us almost all HTTP Headers. Due to the nature of the dynaTrace AJAX Plugin in IE we do not get ALL HTTP Headers but we get the most interesting ones. Our users have already requested this feature on our Community Wish List. Right now I propose to use a Network Sniffer or Proxy such as MS Fiddler, HTTP Watch, Charles, … in case you need more detail than the AJAX Edition provides.

How to improve the performance

Theoretically it is pretty simple to improve performance on sites like this. I say theoretically because some of the proposed changes require some work and changes on the web server or web deployment. Here is a list of proposed changes and an estimated performance gain:

  • Use HTTP 1.1 or at least Connection: Keep-Alive: The web-server runs on HTTP 1.0 and forces the browser to close the physical connection after each request. Use Connection Keep-Alive to avoid unnecessary reconnect efforts. 
    • Estimated Gain: 100-200ms (check the Connect Column in the Network View)
  • Far Future Expires Header: for those elements that change very infrequently use an Expires Header in the Far Future
    • Estimated Gain for returning users: 4-6s (depending on how many objects can really be cached long time)
  • Merge CSS: Merging all 22 CSS files into a single CSS file would eliminate Wait Time and reduce Server and Transfer Time due to reduced HTTP Roundtrips
    • Estimated Gain: 1.3s in Wait Time, 1-2s in Server-Time and Transfer Time (assuming we can merge them)
  • Merging JavaScript: 21 JavaScript files come from the main domain. Merging these eliminate Wait Time and reduce Server and Transfer Time due to reduced HTTP Roundtrips
    • Estimated Gain: 300-500ms
  • Domain Sharding: Spreading the 75 images served from the main page on 2 additional image sub-domains allows the browser to download 4 images in parallel. It also allows other content from the main page, e.g.: AJAX Requests, … to be downloaded without waiting for image downloads
    • Estimated Gain: 2-3s

Conclusion

Small things that are often missed – liked wrong Expires header – make a huge difference in web site performance. If the website of Frankfurt’s Airport would have followed some of the best practices from Google or Yahoo or those that we give here at our dynaTrace Blog I am pretty sure many travellers would have been able to reach their web-site on Sunday (even though we would have still been stranded).

As always – here is a nice list of additional blogs and material that I encourage everybody to read: Steve Souders Blog, How to Speed Up sites like vancouver2010.com by more than 50% in 5 minutes, How to analyze and speed up content rich web sites like www.utah.travel in minutes and Webinar with Monster.com on Best Practices to prevent AJAX/JavaScript performance problems

 

Comments

  1. Markus Leptien says:

    Seems they switched now to HTTP 1.1, but still Keep-Alives are not configured, so it is still closing the connection after each asset.
    This makes things actually WORSE than before. While with HTTP 1.0 it also closed the connection, but was at least using 4 connections in parallel. Now with HTTP 1.1 it is down to the 2 connections you mentioned, and still closing the connection.
    Additionally, apart of 1 Asset, NONE of the Assets are gzipped. And on a side note: The Page starts with 3 Redirects.
    Finally, if they have more time on their hands, they could easily reduce the amount of HTTP Requests by something like 40, if they would do Spriting.

    Unbelievable that large corporations like Frankfurt Airport have Site like this…

    Kind regards,
    Markus

  2. @Markus: good points on the redirects and the connection close. I stopped being surprised about these things after seeing it every day when browsing the web :-)

  3. Good point. It’s hard to believe Frankfurts is/was unaware of this point you mentioned.

    Regards,
    Rajan

  4. Thomas Falkenberg says:

    The newest Versions of all the common browsers use more than two connections (despite the fact that the HTTP 1.1 spec says you SHOULD not use more than two).

    My Firefox 3.6 (about:config – network.http.max-persistent-connections-per-server) uses 6. Opera 10 uses 8, IE 8 uses 6. IE 7 still uses only 2 connections.
    Was that the version you used for your test?

    Considering this, the total amount of connections during high load increases (non-linear with the new browsers) with the number of elements to load, thus the the risk of the webservers running out of connections increases accordingly.

    If this was the case, using keep alive would maybe be a bad idea.

    What do you think?

  5. Thomas Falkenberg says:

    I forgot to add that of course the caching-issue needs to be resolved first. This would greatly reduce the number of connections.

  6. @Thomas: all good points. I used IE7 in my example (i know – old browser – but unfortunately still a major player).
    Modern browsers that use more physical connections by default allow faster parallel download – but – as you indicated correctly – open more connections to the server. For that reason you should follow the best practies of “merging files” (CSS, JavaScript, Images) or use a CDN (Content Delivery Network). A CDN serves mostly static content to the end-user from a location as close as possible to the end-user – thus – taking pressure of the application server (as only dynamic requests make it to the AppServer) and it also speeds up content delivery of static objects as CDN Servers are usually closer to the end-user than your appserver

    Keep in mind. Not using Keep-alive means that you always pay the penality of re-establishing a connection FOR EVERY resource on the page. If you have >100 elements (like on this page) it means that you pay a high price on end-user performance.

    The best way to deal with a situation like the one analyzed is really to bring down the number of resources – this solves most of the highlighted problems

    makes sense?

  7. Markus Leptien says:

    @Thomas:
    Actually it is still linear, though steeper.

    And there is a trade-off, you might have a higher amount of open connections, but less processing for(by a factor of ~40 in this case) TCP Handshakes and Tear-down, and the page loads faster, so the connection is open for a shorter time.

    And as long as they used HTTP1.0 it was:
    Browser HTTP/1.1 HTTP/1.0
    IE 6,7 2 4
    IE 8 6 6
    Firefox 2 2 8
    Firefox 3 6 6
    Safari 3,4 4 4
    Chrome 1,2 6 ?
    Chrome 3 4 4

  8. Thomas Falkenberg says:

    @Andreas: of course a CDN (or a cloud ;-) is the way to go for a high traffic, international website with a high volume on static elements. This usually doesn’t take pressure from the application server though, but from the webserver, but I guess that is what you meant.
    I totally agree that bringing down the number of resources is the key here. My point is that you have to know and monitor your connections (consuming handles,RAM) BEFORE switching on keep-alive. Otherwise you risk a complete outage. I’ve seen exactly this case before, that’s why I wanted to mention.
    @Markus:I was refering to a long-term trend, where newer browser versions gain a bigger market share. This will bring your total connections up in a non-linear manner, but I know what you meant.
    I don’t understand the numbers, what do they represent?

  9. Markus Leptien says:

    @Thomas: Yes, sorry, the Blog-Software stripped the formatting. What that meant is that for example
    Firefox 3 uses with HTTP 1.0 6 connections in parallel, and with HTTP 1.1 6 connections. IE 8 the same.
    Safari 3 and 4 uses with HTTP 1.0 4 connections, and with HTTP 1.1 4 connections. Chrome the same.
    So with the newer Browsers, switching protocols between HTTP 1.0 and HTTP 1.1 will not reduce the amount of concurrent connections.

  10. I forgot to add that of course the caching-issue needs to be resolved first. This would greatly reduce the number of connections.

  11. Great analysis Andreas, hope you made it safely back after we met in Linz :) I also had a hard time digging for information below the ash. Most airlines and airports were either down or not providing updates…

    I am pretty sure that almost all sites that serve most of their content on their own could work with 1-2 seconds keep alive much better than without. Sure you leave the last 1 or 2 connections open for 2 seconds longer than you would have needed. But that should not be the limiting factor if the server is already serving content for 5 or more seconds.

    if you do not serve content besides html, because you don’t have it, or its served off an cdn I would not turn it on :)

  12. Great analysis Andreas, hope you made it safely back after we met in Linz :) I also had a hard time digging for information below the ash. Most airlines and airports were either down or not providing updates…

    I am pretty sure that almost all sites that serve most of their content on their own could work with 1-2 seconds keep alive much better than without. Sure you leave the last 1 or 2 connections open for 2 seconds longer than you would have needed. But that should not be the limiting factor if the server is already serving content for 5 or more seconds.

    if you do not serve content besides html, because you don’t have it, or its served off an cdn I would not turn it on :)

  13. History repeating. Berlin Airport almost shut down due to snow. Webpage often not reachable due to overload. Closer look reveals:
    1. No GZIP and no minification at all (Neither HTML/JS/CSS)
    2. No Keep-alives
    3. No Caching-Headers
    4. Suspicious ETAG-Headers
    5. More than 0,5 MBytes of weight, one image alone is more than 25% of it.
    6. 4 blocking Javascripts right at the beginning of the page
    7. No spriting

    http://www.berlin-airport.de

  14. Hi Markus – yeah – it is interesting that there are still so many sites out there that do not follow those best practices. Can’t say that everything is easy to change – but most changes should be easy

  15. Henry S says:

    Hi Andreas, I’m just now reading this much later than posted. What I find amazing and funny is that once you start looking at this type of data, you find it everywhere you go.

    Here you are stuck at an airport and you end up doing analysis of the sites that are providing you information. Looking for clues as to why your usage experience is what it is. Then you end up creating a post / blog about what you have found and that this is in the wild and not in a lab testing example.

    One would wonder if this ever gets back to the companies that can actually do something about your experience this time or if this end ups being information that the rest of us take back and then look at our own companies to see if we have created the same type of problems.

    For instance we were fine with the number of VPN connections we put into place until we had a snow storm and everyone wanted to work via VPN. Then we ran out big time. Yet prior to that, management did not want to provision that many VPN connections. Now of course, that is exactly what was approved. But after the fact of course. Ready for next time.

    Just like your experience with the Air-Line web site. It was fine until there was something that drove tons more traffic to it. Then it fell flat and performed poorly.

    Life sure has a way of doing that to us.

  16. Hi Henry

    I actually have a little success story on one of my blogs. Two years back I posted a similar post about the Fifa World Cup page: http://blog.dynatrace.com/2010/06/04/hands-on-guide-verifying-fifa-world-cup-web-site-against-performance-best-practices/
    In this case the responsible engineering team of that page reached out to me directly ensuring me that they are already working on improvements. With the next Fifa Tournament coming up in just 2 weeks we will see if these improvements are still there :-)

    Andi

Comments

*


9 + = sixteen