Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Don’t let your Load Balancers ruin your Holiday Business

An eCommerce site that crashes 7 times during the Christmas season, being down for up to five hours each time it crashes is a site that loses a lot of money and suffers reputation damage. It happened to one of our customers, before we started working with them. They shared their story and what they learned at our annual perform conference earlier this month. Among several reasons that led to these crashes I want to share more details on one of them which I see more often with other websites as well. Load Balancers on Round-Robin instead of Least-Busy can easily lead to App Server crashes caused by heap memory exhaustion. Let’s dig into some details, and see how to identify these problems and how to avoid them.

The Symptom: Crashing Tomcat Instances

The website is deployed on 6 Tomcats with 3 Frontend Apache Web Servers. During peak load hours individual Tomcat instances started showing growing response times and a growing number of requests in the Tomcat processing queue. After a while these instances crashed due to out-of-memory exceptions and with that also brought down the rest of the site as load couldn’t be handled any more with the remaining servers. The following image shows the actual flow of transactions through the system highlighting unevenly distributed response time in the Application Servers and functional errors being reported on all tiers (red colored server icon):

Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing

Even with equally distributed load (Round Robin Load Balancer Setting) one of the Tomcats spiked in Response Time Contribution before crashing

Once the App Server started rejecting incoming connections we can observe the first ripple effect of errors. We can see a very high number of Exceptions in the Database Layer, Exceptions thrown between Application Tiers the Web App responding with HTTP 500s:

Over 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication

Over 30 minutes the application serves 43000 pages with an HTTP 500 Response correlating to Exceptions in the Database and Inter-Tier Communication

The Root Cause: Inefficient Database Statements and Connection Pool Usage

The exceptions caught in the Database Layer (JDBC) were already a very good hint for the root cause of this problem. A closer look at the Exceptions shows that connection pools are exhausted which causes problems in the different components of the application:

Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering

Exhausted Connection Pool causes Exceptions that impact Data Access Layer as well as Widget Rendering

Looking at the performance breakdown by application layer reveals how much performance impact connection pooling has on the overall transaction response time:

Due to the Connection Pool Problem a single request had to wait 3.8s on average to obtain a connection from the pool

Due to the Connection Pool Problem a single request had to wait 3.8s on average to obtain a connection from the pool

Now, it was not only the size of the pool that was the problem – but – several very inefficient database statements that took a long time to execute for some of the application’s business transactions. This caused the Application Server to hold on to the Connection for longer than normal. As the load balancer was configured with Round Robin the App Server still got additional requests served. Eventually – just by the random nature of incoming requests – one App Server received several of these requests which were executing these inefficient database calls. Once the connection pool was exhausted the application started throwing exceptions which ultimately also led to a crash of the JVM. Once the first App Server crashed it didn’t take too long to take the other App Servers down as well.

The Solution: Optimizing App and Load Balancer

The problem was fixed by looking at the slowest database statements and optimizing them for performance by e.g: adding indices on the database or making the SQL statements more efficient. They also optimized the pool size to accommodate the expected load during peak hours.

They started by optimizing SQL Statements that a took long time to execute and those that got executed several times within the same transaction

They started by optimizing SQL Statements that took a long time to execute and those that got executed several times within the same transaction

They also changed the Load Balancer setting from Round-Robin to Least-Busy which was the preferred setting from the LB Vendor – this configuration had simply been forgotten in the production environment.

The Result: No Site Downtime Since

Since they made the changes to the application and the Load Balancer the site has never gone down. Now – the next holiday season is coming up and they are ready for the upcoming seasonal spikes. Even though they are really confident that everything will work without problems they learned their lesson and are approaching performance proactively through proper load testing.

Next Steps: Proactive Performance Management

The lesson learned was that these problems could have been found prior to the holiday shopping season by doing proper load testing. They did load testing before but never encountered this problem because of two reasons:

  1. they didn’t test using expected peak volumes for long enough sessions and
  2. they didn’t use a tool that simulated real customer behavior variations (too few scripts and the scripts were too simple) and tested their highly interactive web site.

Their strategy for proactive performance management is that they

  1. Perform Load Tests on the production system during low traffic hours (2AM-6AM), accepting the risk of minor sales losses in case of a crash, versus major sales losses during the holiday shopping season.
  2. Multiply the hourly load test volume by 2.5 since their actual peaks are 10 hours long.
  3. Use a Load Testing Service that uses real browsers in different locations around the US.
  4. Use an APM Solution that identifies problems within the application while running the load test.

If you want to read more on common performance problems that are not found prior to moving to production check out my recent series of blogs: Supersized Content, Deployment Mistakes or Excessive Logging

Comments

  1. Mario Guerrero says:

    Hi,

    Thanks for the post.

    Please, can you tell me which tools were used to get the analysis shown in the images 2,3,4 and 5?

    These images for monitoring the system are very good.

    Thank you.

  2. The best thing to address such performance issues is to enable slowquery on MySQL and then optimize all the statements that are logged.

    This will dramatically enhance the speed. From my experience, over 90% of performance related issues are directly related to the database.

    • I agree with you that a lot of performance problems are related to the database. But – besides having individual statements that are slow we also often see applications that make too many database statements or just request the wrong or too much data. This are then problems you have to address in the application. If you are interested check out some of my earlier blog posts, e.g:

  3. J. O'Connor says:

    I avoid any “least busy, least connections, least response time” logic in load balancing method, unless session persistence is required by the application. There are a couple of reasons why these “least-whatever” LB methods are potentially problematic.

    1.) These Least connection/busy LB methods typically implement session persistence as a byproduct. This is because the LB typically requires this to keep track of “who is working where” to make its future routing decisions. In your example, this may have already been expected/required, so it wasn’t much of a downside, but if you don’t need session persistence (or handle at a tier lower than LB), its a problem that can create hotspots.

    2.) Least Conn/Least busy algorithms also tend to use source IP as the hash for initial session allocation decision. While if your customers hit your LB directly, this may not be much of a concern, it can be especially problematic if you have your traffic coming through a edge reverse proxy or CDN (all traffic w/ limited number of source IP’s). This can have a result of all traffic distributed to a limited number of your hosts, so that some hosts are near idle, while others are pegged.

    In my experience, the “dumbest” LB methods are the most scalable solution for MOST workloads (1st random, 2nd round robin). I would generally agree w/ the previous poster that you should NOT solve application and/or database/query performance issues with the LB method. And if you are absolutely convinced you need some “smart” LB method, test in the most production-like method possible.

Comments

*


1 + eight =