An Integrated Approach to Load Test Analysis, Part 2 – The Follow-up Test
This post was co-authored by Andreas Grabner, Team Lead for the Compuware APM Center of Excellence.
In a previous post, I demonstrated how to add more depth to the analysis of a Compuware APM Web Load Test by combining the external load results with the application and infrastructure data collected by the Compuware PureStack Technology™. But, now that we have tested the system once, what would happen if we tested it again after we identified and “resolved” the issues we found? Would running a test using the same parameters as in the initial test show a clear performance improvement? Would the system be able to achieve the desired load of 200 virtual users with little or no performance degradation?
This blog takes you through the steps you should follow in order to directly compare the results of 2 load tests and measure the performance improvement (or degradation) that occurred with the fixes put in place.
STEP 1: Identify issues and implement changes based on initial results
During the April 14 load test session, Andreas Grabner and I found that there were substantial performance concerns for the application under a load that is well in excess of what is currently seen even on the APM Community’s busiest days. The issue was that the load that caused the performance issues was well short of the goal of 200 virtual users (VUs) that the application team wanted to reach.
During the April 14 load test execution, a number of environment issues were identified. The critical ones by the systems team addressed included:
- Deployment of critical APM Community applications to different machines to prevent the application performance of one layer negatively affecting another layer
- Optimization of the way APM Community pages are built in the application layer to reduce CPU usage
- Optimized Cache Settings in Confluence to reduce roundtrips to the database when loading commonly used objects
- Increasing the CPU power on the virtualized machines so that they can handle more load.
STEP 2: Re-Run the test (With the same parameters!)
Once these steps were complete, a second test cycle was scheduled to determine if the updated environment would be able to reach the desired 200 VU target without encountering response time degradation. The follow-up load test was executed exactly 1 week later, on April 21, and used the same parameters as the initial test (see previous post for load ramping details). Using the same test parameters (load ramp, test scripts, testing locations, databanks, etc.) is critical in order to allow a like-for-like comparison to occur. Any deviation in the test configuration can skew the results and potentially lead to an unintended sense of confidence (or fear of implosion) regarding the application environment.
When the April 21 round of load testing was complete and we began to analyze the results of the test, the initial data (higher throughput, faster response times, lower CPU utilization and a reduction in the amount of database load) suggested that this load test was substantially more successful than the previous test execution. This initial conclusion was based on the performance charts containing the same metrics we used to analyze the April 14 test, which showed a direct comparison of critical measures, demonstrating if the pattern of performance had dramatically changed between the two test executions.
Step 3: Compare the Results
So, to start the comparative analysis, we took three key metrics of the April 14 and April 21 results and charted them together: External Web Load Test Average Response Time; Web Load Test Transactions per Minute; and percentage of CPU Utilization on the web server. Using just these three comparisons, it is clear that the two load tests had very different performance profiles.
Starting with the Web Load Test Average Response Times (the time required to completely download all of the content in the scripted synthetic transactions used in the load test), it is very clear that after 08:50 EDT – 40 minutes into both tests – that the response times diverged and remained on different paths for the remainder of the comparative test run. From this point on, the April 21 load test averaged load times that were around 50% faster than the April 14 test (Note: the Moving Average of percentage change averages 5 minutes of response time change to produce an clearer trend line). It took nearly 20 more minutes for Average Transaction Response Time to reach 20 seconds on April 21, even with load being applied at the same volume as in the April 14 test.
The Web Load Transactions per Minute (the number of WLT transactions executed in a minute at that point of the load test) showed a pattern where the April 21 test also diverged from the April 14 test at 08:50 EDT. With the faster WLT Average Response Times, the April 21 test saw the system process 40-50% more transactions per minute than the April 14 test from 08:50 EDT until the end of the test cycle.
Much of this improvement can be tracked to the third metric: CPU utilization on the Web Server (the percentage of CPU used by the system and applications for performing all necessary activities on the machine). Throughout the April 21 test, the CPU of the web server, with more hardware and optimized page rendering processes helping out, the CPU was less heavy stressed throughout the test, reaching 100% utilization much later than in the April 14 test.
These three metrics are directly tied to the Number of Web Requests per Minute recorded at the Confluence application layer for the April 21 test. This metric peaked at 125-140 per minute during the April 21 test, compared to the April 14 test where the peak was at approximately 100 Web Requests per minute.
Despite the seeming success of the second load test on April 21, there were still issues that appeared. Building an integrated results chart for the April 21 load test shows that multiple performance events occurred once the load test reached the 100% CPU Utilization boundary (red vertical line in chart below). This appears to indicate that despite the improvements to the environment discussed above, there is still a CPU bottleneck present at higher loads.
An area of extreme contrast between the two tests was recorded in the Database Results. Database stats were clearly visible in the data from April 14 test (see the aggregated performance metric chart in STEP 1), including a large spike in the number and length of queries just before the application reached the CPU bottleneck. But in order to find the same metrics in the April 21 test, you have to break out your microscope and look very closely at the bottom of the chart.
The reduction in database load was the direct result of the optimized cache settings enabled after the April 14 load test. With more of the data being stored in the application cache, the number of calls to the database decreased, removing this layer as a potential bottleneck at this load volume.
Step 4: Results and Next Steps
The lack of a sudden spike in Confluence/Atlassian processing time in the April 21 test (along with the accompanying database spike) was due to the removal of an application layer process that had been scheduled to run during the load test period. This process, and its effects on the systems and user experience, was quickly recognized once Andreas reviewed his data. Once the job that caused this issue was identified, it was removed in time for the April 21 test, completely eliminating a performance bottleneck that was encountered early in the April 14 test.
Lesson learned: Don’t schedule system intensive jobs to run during peak traffic periods; find a window with the lowest traffic volume to perform these tasks so that the fewest visitors possible are affected.
As we noted at the start of this post, it appears on the surface that the April 21 load test was more successful than the April 14 test. Yet, despite the improved performance of the April 21 load test, the results still show that there are still performance concerns in the test that need to be addressed. These concerns center around a dramatic spike in response times between 09:40 and 09:50 EDT, occurring after the load test had been running for 90 minutes.
When the system began to show degraded performance, it could easily be tracked using the 3 key metrics: WLT Average Response Time; WLT Transactions per minute; and CPU Utilization. When running transactions began to take much longer to execute, decreasing both the number of incoming web requests to the application layer and the number of transactions per minute executed by the load generation system, the root cause can be seen in the chart below, which removes some of the data series.
The period of degradation that was detected during the load test started at 09:40 EDT and coincided with the:
- Web Load Test achieving 167 VUs
- CPU on the web server measuring 100%
- Web Load Test Transactions per Minute averaging 130
- Confluence Web Requests (the application layer of the APM Community Portal) measured at 135 per minute
Interestingly, after 10 minutes, this issue cleared up completely, except for transaction response times. The response times did not return to pre-spike values, but were now averaging almost 20 seconds higher than before the spike. With the system now peaked at 200 VUs and no additional load being generated, it was interesting to see that other metrics returned immediately to their pre-spike levels – notably Transactions per Minute and Web Requests per Minute. So, with 33 more VUs than before the spike, the system again appeared to be directly affected by a CPU bottleneck, as a higher load could not increase the number of requests processed at the application layer.
Out of this sea of metrics we determined that the performance of the April 21 load test saw a comparative improvement in the application when examined next to the April 14 load test, but the second test was still unable to reach the target of 200 VUs without suffering a bottleneck that caused performance to degrade dramatically.
Analyzing the degradation
To find the cause of the CPU bottleneck that prevented the April 21 test from reaching the goal of 200 VUs with little or no performance degradation, we have to dig deeper into the server-side metrics, especially those related to the health of application server. The dip in transactions throughout the system is aligned with the issue captured when the system hit 167 VUs. The question is: Was the dip in transactions processed and the rise in transaction response times the result of this load volume or a symptom of the actual cause of the performance degradation?
When the system degrades, the server-side data shows that high Garbage Collection could be a problem, as this automated process happened at the same time. It is clear that executing a very intensive system process when the web server CPU was already exhausted can cause a very large performance degradation.
Looking at the application server specific transaction response times it is easy to spot the potential problem. The following charts show that “yet another” background job is executed every hour taking away CPU cycles from the already exhausted system.
Looking at these transactions reveals that the job is an hourly update job that synchronizes the cached user objects with the user directory database. This takes a considerable amount of time because we have 65k+ users on the APM Community system. This update job causes a lot of objects to be created and destroyed – hence the increased memory and GC activity.
As with the April 14 load test, the April 21 load test exposed issues with the system that prevented the achievement of the 200 VU goal. But now, we have a clear culprit for the prevention of this goal, so efforts can focus on reducing or eliminating the effect that this update process has on the system when it is under peak load.
In both tests, regardless of how you measure the “success” of a load test, something was learned about the system by aggregating metrics from inside and outside the infrastructure being tested. We now know that the optimizations that were performed after the April 14 load test allowed the system to process 40-50% more transactions per minute up to 167 VUs when a scheduled system process caused a severe application degradation.
This data was only able to be turned into actionable information because we had a process in place that allowed results captured from inside the firewall to be easily aligned with the external results from the load test system. By doing this, the customer, albeit in a very controlled form, becomes a factor in the analysis of system performance.
By creating a full performance perspective, PureStack delivers more than just deeper technical metrics on a system under load. PureStack places the experience of the visitor at the same level of importance as CPU, database, and web requests processed by the application layer when the results are analyzed. The importance of the user experience then dictates how infrastructure issues are prioritized and resolved, as the effect these issues have on end users provides real-world feedback into the true cost of performance issues that occur to your application during peak periods.
Using the data from this load test, it was realized that additional changes to the system were needed, especially in the area of page rendering, in order to further reduce CPU load and allow the system to reach and maintain a peak load of 200 virtual users. With the upgrade to the Confluence application software – deployed in early July 2013 – it was expected that the desired goal would be reached. But assuming this is not sufficient; it is expected that an additional load test on the new Confluence system will occur in July 2013, once the system has been completed stabilized. And using the same transaction paths as in the April 14 and 21 load tests, the system will be verified to confirm that the upgrade is delivering the expected performance.