It is common knowledge that large scale testing should be done prior to releasing new software. Otherwise, how would you know if the system works as expected under real user load? The story today comes from one of our eCommerce customers that enhanced its Gift Card Balance Check with new features. Before going live the company runs large scale load tests to make sure things are working correctly. The first tests run by the testing team highlighted
- Memory Leaks that resulted in out-of-memory crashes
- performance impact due to incorrect AppServer deployment settings
- JVM overloads due to wrong load balance strategy
Let’s look a bit deeper into the actual problems the test team ran into and into the lessons learned.
Finding #1: Memory Leak Caused by Missing Cleanup Code
The balance check feature is typically used by customers calling into the service through an automated phone service. Additionally, there is a browser based application used by call center employees that can also check the balance for the customers they have on the line.
For the load test, it was decided to test through the web interface – even though the majority of users will use the automated phone system. Every test the team ran quickly highlighted a memory leak in their frontend service JVMs that run in a cluster. This memory leak led to out-of-memory exceptions crashing their JVMs as shown in the following chart:
Analyzing the memory dump that was automatically triggered with every OOM Exception reveals the problematic objects on the heap. The hotspots are the Session objects created for every user session that queries the balance (whether via
phone or web interface).
The vendor of the framework informed the test team that each of these session objects also holds a session to the backend application server. This session is closed when the end user drops the phone line. For the web interface there is a “manual” close connection button which was called from the load testing script, but was not implemented correctly. That explained why these Session objects were still consuming so much memory as a proper cleanup was not done due to the missing close connection implementation.
Lessons learned: The team members discovered a memory leak that would have not been easily spotted in the live system because the majority of users query the balance via the phone where the cleanup works correctly. However, as there are queries through call center agents using the web interface, this problem would have caused OOMs at some point in time. But, the problem would have been much harder to spot. On the other side, they also learned that they are not testing the real life end user scenario. Simulating and testing “real” phone calls is the next step to work on.
Finding #2: What Deployment Settings are Required to Handle the Expected Load
With the new features in place, the company is expecting more people to access the balance check option. That’s why the test team ran its tests with a very high number of concurrent users. The team members saw that end user wait time was much higher than their targeted SLA (Service Level Agreement). A closer look at the WebSphere Active Thread Count revealed that their configuration wasn’t adapted to the higher user load.
Increasing maxThread Count brought another problem
So, they increased the max worker thread count to 160. Unfortunately this change brought up a new problem. They saw occasional crashes of individual JVMs. It seemed that the load balancer wasn’t configured correctly and was giving too much load to servers that were currently working on some process intensive requests.
Adjusting Load Balance Strategy
After adjusting the load testing strategy from Round-Robin to Least-Busy the team also solved thread exhaustion problem as load got better distributed. The JVMs had enough head room on the worker threads and all the crashes were gone.
Ready to Deploy Just in Time for the Holiday Shopping Season
As you may have noticed from the timeframe of these charts – these tests were done just before the holiday shopping season of 2013. It was very critical for this eCommerce customer to get this feature out and be confident that the system can handle the expected load. Testing revealed problems that could be fixed but also allowed the customer to find the correct deployment settings that it can directly apply to production.
In case you also wanted to analyze memory issues like this and the impact on application performance you can go ahead and download the 15 day free trial of dynaTrace.