Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Fix Memory Leaks in Java Production Applications

Adding more memory to your JVMs (Java Virtual Machines) might be a temporary solution to fixing memory leaks in Java applications, but it for sure won’t fix the root cause of the issue. Instead of crashing once per day it may just crash every other day. “Preventive” restarts are also just another desperate measure to minimize downtime – but – let’s be frank: this is not how production issues should be solved.

One of our customers – a large online retail store – ran into such an issue. They run one of their online gift card self-service interfaces on two JVMs. Especially during peak holiday seasons – when users are activating their gift cards or checking the balance – crashes due to OOM (Out Of Memory) were more frequent which caused bad user experience. The first “measure” they took was to double the JVM Heap Size. This didn’t solve the problem as JVMs were still crashing, so they followed the memory diagnostics approach for production as explained in Java Memory Leaks to identify and fix the root cause of the problem.

Before we walk through the individual steps, let’s look at the memory graph that shows the problems they had in December during the peak of the holiday season. The problem persisted even after increasing the memory. They could fix the problem after identifying the real root cause and applying specific configuration changes to a 3rd party software component:

After identifying the actual root cause and applying necessary configuration changes did the memory leak issue go away? Increasing Memory was not even a temporary solution that worked.

After identifying the actual root cause and applying necessary configuration changes did the memory leak issue go away? Increasing Memory was not even a temporary solution that worked.

Step 1: Identify a Java Memory Leak

The first step is to monitor the JVM/CLR Memory Metrics such as Heap Space. This will tell us whether there is a potential memory leak. In this case we see memory usage constantly growing resulting in an eventual runtime crash when the memory limit is reached.

Java Heap Size of both JVMs showed significant growth starting Dec 2nd and Dec 4th resulting in a crash on Dec 6th for both JVMs when the 512MB Max Heap Size was exceeded.

Java Heap Size of both JVMs showed significant growth starting Dec 2nd and Dec 4th resulting in a crash on Dec 6th for both JVMs when the 512MB Max Heap Size was exceeded.

Step 2: Identify problematic Java Objects

The out-of-memory exception automatically triggers a full memory dump that allows for analysis of which objects consumed the heap and are most likely to be the root cause of the out-of-memory crash. Looking at the objects that consumed most of the heap below indicates that they are related to a 3rd party logging API used by the application.

Sorting by GC (Garbage Collection) Size and focusing on custom classes (instead of system classes) shows that 80% of the heap is consumed by classes of a 3rd party logging framework

Sorting by GC (Garbage Collection) Size and focusing on custom classes (instead of system classes) shows that 80% of the heap is consumed by classes of a 3rd party logging framework

A closer look at an instance of the VPReportEntry4 shows that it contains 5 Strings – with one consuming 23KB (as compared to several bytes of other string objects).This also explains the high GC Size of the String class in the overall Heap Dump.

Individual very large String objects as part of the ReportEntry object

Individual very large String objects as part of the ReportEntry object

Following the referrer chain further up reveals the complete picture. The EventQueue keeps LogEvents in an Array which itself keeps VPReportEntrys in an Array. All of these objects seem to be kept in memory as the objects are being added to these arrays but never removed and therefore not garbage collected:

Following the referrer tree reveals that global EventQueue objects hold on to the LogEvent and VPReportEntry objects in array lists which are never removed from these arrays

Following the referrer tree reveals that global EventQueue objects hold on to the LogEvent and VPReportEntry objects in array lists which are never removed from these arrays

Step 3: Who allocates these objects?

Analyzing object allocation allows us to figure out which part of the code is creating these objects and adding them to the queue. Creating what is called a “Selective Memory Dump” when the application reached 75% Heap Utilization showed the customer that the ReportWriter.report method allocated these entries and that they have been “living” on the heap for quite a while.

It is the report method that allocates the VPReportEntry objects which stay on the heap for quite a while

It is the report method that allocates the VPReportEntry objects which stay on the heap for quite a while

Step 4: Why are these objects not removed from the Heap?

The premise of the 3rd party logging framework is that log entries will be created by the application and written in batches at certain times by sending these log entries to a remote logging service using JMS. The memory behavior indicates that – even though these log entries might be sent to the service, these objects are not always removed from the EventQueue leading to the out-of-memory exception.

Further analysis revealed that the background batch writer thread calls a logBatch method which loops through the event queue (calling EventQueue.next) to send current log events in the queue. The question is whether as many messages were taken out of the queue (using next) vs put into the queue (using add) and whether the batch job is really called frequently enough to keep up with the incoming event entries. The following chart shows the method executions of add,  as well as the call to logBatch highlighting that logBatch is actually not called frequently enough and therefore not calling next to remove messages from the queue:

The highlighted area shows that messages are put into the queue but not taken out because the background batch job is not executed. Once this leads to an OOM and the system restarts it goes back to normal operation but older log messages will be lost.

The highlighted area shows that messages are put into the queue but not taken out because the background batch job is not executed. Once this leads to an OOM and the system restarts it goes back to normal operation but older log messages will be lost.

Step 5: Fixing the Java Memory Leak problem

After providing this information to the 3rd party provider and discussing with them the number of log entries and their system environment the conclusion was that our customer used a special logging mode that was not supposed to be used in high-load production environments. It is like running with DEBUG log level in a high load or production environment. This overwhelmed the remote logging service and this is why the batch logging thread was stopped and log events remained in the EventQueue until the out of memory occurred.

After making the recommended changes the system could again run with the previous heap memory size without experiencing any out-of-memory exceptions.

The Memory Leak issue has been solved and the application now runs even with the initial 512MB Heap Space without any problem.

The Memory Leak issue has been solved and the application now runs even with the initial 512MB Heap Space without any problem.

They still use the same dashboards they have built to troubleshoot this issue, to monitor for any future excessive logging problems.

These dashboards allow them to verify that the logging framework can keep up with log messages after they applied the changes.

These dashboards allow them to verify that the logging framework can keep up with log messages after they applied the changes.

Conclusion

Adding additional memory to crashing JVMs is most often not a temporary fix. If you have a real Java memory leak it will just take longer until the Java runtime crashes. It will even incur more overhead due to garbage collection when using larger heaps. The real answer to this is to use the simple approach explained here. Look at the memory metrics to identify whether you have a leak or not. Then identify which objects are causing the issue and why they are not collected by the GC. Working with engineers or 3rd party providers (as in this case) will help you find a permanent solution that allows you to run the system without impacting end users and without additional resource requirements.

Next Steps

If you want to learn more about Java Memory Management or general Application Performance Best Practices check out our free online Java Enterprise Performance Book. Existing customers of our APM Solution may also want to check out additional best practices on our APM Community.

Comments

  1. Quite an interesting article. Would you like to know what tools where used to generate the graphs and figure out the issue?

    • They use Compuware/dynaTrace as their APM (Application Performance Management) solution. Memory Leak Detection is part of that solution. If you look at some of our other blogs you can see what other customers do with the same product.

  2. I was expecting a better example. The above is too easy to detect.

    • If it would be that easy than these things wouldn’t make it into production causing huge problems in big enterprise applications. I agree with you that these problems are easy to find – also thanks to the available tooling. What we find with our engagements is that a lot of companies always struggle with similar issues (some may call it easy ones to find or low hanging fruit). Still these problems are out there everywhere. Thats why we blog about it. We have some other interesting memory related best practices in some of or other blogs – or – on our online book that my colleague Michael Kopp wrote. Its the chapter “Memory Management” – starting with

  3. We were able to quickly identify a Permgen memory issue which is much harder to figure out than Heap memory. Able to see the classloading and identify the problem package, this all in matter of minutes if not hours, as we were not able to identify manually.

  4. This is very good step by step explanation for people like me who are just trying to tread the waters with Java GC.

Comments

*


+ eight = 17