Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

How to Triple Throughput and Improve Application Performance Through End-to-End Testing

Thanks to the great guys that help our customers with their application performance problems we can share some of their stories on this blog. We hope you – responsible for application performance in your own organization – can leverage these findings in order to prevent these common problem patterns we see out there in the real world.

In this blog I want to highlight some typical problems in web applications that can easily be identified through load testing and can lead to significant improvements in throughput and performance. In this case a 94% faster transaction performance was achieved and throughput could be tripled. It was all possible by fixing deployment problems on the Web Server. Here is story on how they did it!

Challenge: Is End User Response Time is unacceptable or not? If so – where is the problem?

Load Tests are great. They tell you whether your application can handle the simulated load by staying within the acceptable response times for the tested transactions. When just looking at the average response time as measured on the web servers it will be hard to tell

  • Do we have a performance problem at all?
  • How can we improve the performance?

The following shows a typical graph you get from a load testing tool or by analyzing your web server logs. The test that was executed simulated constant load after a short warm-up period. The results show that Average Transaction Response Time increased slightly over time with one outlier up to 3 seconds. The throughput of the system (Transaction Count) on the other side went slightly down. This can be expected when response time goes up. The question is – is this a problem? Is an average of 1.5s bad User Experience?

Declining Transaction Performance on both web servers also leads to less throughput

Declining Transaction Performance on both web servers also leads to less throughput

DO NOT TRUST Average Values: Focused analysis is required to identify problems!

One lesson that all of our customers have learned is that you do not want to analyze your performance by looking at the average execution time of ALL of your simulated transactions. This would give a wrong picture as certain transactions will always be fast because they are optimized where others are slow because there really is a problem. If you look at ALL of them at once – and then just at averages – it is very likely that you never find that you actually have a problem as it will hide behind the statistically calculated values.

Therefore you need to focus your analysis on individual transaction types that you test. The following screenshot shows a performance breakdown of the individual tested transactions. The chart on top shows that certain transactions have a significant increase in response time where others only have a slight increase. On average the application is not performing too badly – but it is these individual transactions under load that are the real problem for the end users. Even worse if these are the transactions that are critical to your application:

Different transaction types perform differently. Looking at overall averages would not reveal these problems

Different transaction types perform differently. Looking at overall averages would not reveal these problems

The breakdown by tested transaction shows us that there are at least two transactions that showed spikes of up to 21s to execute. One of them is the Login transaction which is very critical to the application. Now it is time to focus our next analysis step to these transactions in order to rid of the “statistical noise” of the other transactions that actually ran fine.

LOOK AT THE End-to-End View: It shows you where your problems are

The next step in the problem analysis is to look beyond the measured response time on the web server. Analyzing the full End-to-End view reveals which component in the infrastructure contributes the most to the overall performance. This allows you to attack the problem where it happens without trying to improve components that may actually work really well. The following image shows the Transaction Flow Visualization of each individual request that was generated during the load test for the one transaction type we are focused on. Instead of just showing response as perceived by the end user (or virtual simulated user) it shows which component along the transaction execution contributed how much to the response time. It is easy to spot that this problem is not related to the 4 Java Application Server but can be found on the two load balanced Web Servers where 87% of the time is spent:

Analyzing the flow of the tested transaction reveals the component we need to focus our performance analysis on

Analyzing the flow of the tested transaction reveals the component we need to focus our performance analysis on

Typical PROBLEM PATTERNS on the Web Server

I recently blogged about the typical deployment problems that happen when moving an application from test to production: In the case of this blog it was a combination of misconfigured Web Server Settings (Max Connections and Misconfigured Modules). Other problems we typically see are oversized web pages leading to too much load on the web server to deliver that content.

IMPROVEMENT: 3x Throughput and 94% Performance Gain

After fixing the problem the customer can now run about up to 30000 transactions per Web Server instead of 10000. The average response time also went down from ~1.19s to ~68ms. Not only is this great for end user experience but it also means that the existing hardware can be much better leveraged and supports many more users as originally anticipated. The following shows the final charts and transaction flow visualization of a test that was re-ran after all problems identified could be addressed:

Much Higher and Constant Throughput and Performance after fixing the identified performance problems

Much Higher and Constant Throughput and Performance after fixing the identified performance problems

THERE IS MORE: Browser, CNDs, Network, Web Servers, Application Servers, Databases, …

Obviously problems cannot always just be found in one component. Typically when you address one problem the problem shifts to the next, e.g.: too many database calls executed per transaction, too heavy JavaScript libraries in the browser or cross-application impact in your infrastructure. Here are some links with additional reading material with more stories from the real world:

If you have your own stories that you want to share feel free to contact us.

 

Comments

  1. Really informative. Thanks.

Comments

*


+ 6 = fourteen