100% Performance Overhead by WebSphere Activity Log: When Dev is Not Aware of Settings in Production
The potential overhead and performance impact of log outputs has been discussed in several of our previous blog posts. In this new case we came across a severe response time impact on WebSphere caused by incorrect usage of a global shared logging service. The following two graphs show average and maximum response time of one of our customer’s key online service pages. Spikes up to 500s (for one of their self-service portals) turned out to be caused by synchronization issues related to IBM WebSphere’s activity log service:
A team of performance engineers looked into this problem and identified the issue to be caused by two things: a) usage of the Root Logger as the general application logger and b) having a log level of INFO on the production environment.
After talking with the engineering team it turned out that the members used custom Logger class on purpose instead of a log framework such as Log4J. They never intended these info log messages to end up in a log file that’s why they never thought about any overhead in writing these log messages to disk. Due to the way WebSphere’s activity log service was setup all of these info log messages ended up in the synchronized log file. Due to the high number of info messages and the fact that the activity log file is synchronized across all JVMs in the cluster this “unintended log spam” resulted in 95% synchronization time and therefore negatively impacting end user response time.
After fixing the problem they reduced the execution time of their main method that generates log output by 50% on average. As this method is called several thousand times per minute they reduced Response Time by 8.3 hours per day as visualized in the following formula:
Let’s have a closer look on how they identified the root cause of this problem to be the WebSphere’s activity log, what its short and long term approach of fixing is and how this could have been avoided by applying performance and architectural best practices as discussed in our recent DevOps-related already during development.
Step 1: From Impacted Users to Root Cause
Monitoring Response Time of critical pages is something that this customer already had in place as we could already see from the response time chart in the intro section of this blog. In its case both the average and maximum response time increased drastically during the busy early morning hours. It is clear that this is a problem affecting all of their users. Focusing on the specific timeframe and the slow transactions at that time made it easy to identify that all of these transactions spent more than 95% of their time in synchronization:
Finding # 1: Log Spam
Analyzing which methods spent their time in synchronization shows that it is IBM’s SharedLogWriter.logMessage method. But who calls this method? Analyzing the caller chain (those methods that end up calling logMessage) allows us to identify the actual custom code that is responsible for the calls to logMessage and the high sync time. The following screenshot shows that all this time is spent in calls originating from two custom method implementations who call WsLogger.info – which means that they actually log INFO log messages in a production environment:
Finding #2: Shared Synchronized Log across all JVMs
Besides the Sync Time Hotspot they could also observe an unusual high number of SharedLogLockException’s thrown in those web requests that also have the calls to logMessage. The exception message indicate that the log file in question is the shared activity.log file used by the IBM Service Log. It is a shared log file across all JVMs on the same node. It seems that the custom application code logs a high number of INFO log messages that are written to the same shared activity.log file. As this log file is synchronized across all JVMs we have the root cause of the high sync time as well as the exceptions:
Finding #3: Application uses RootLogger
Analyzing a Java memory dump by focusing on Logger objects on the heap showed that there were only two logging objects (java.util.Logger) which both reference the RootLogger object as parent logger. This means that everything that is logged in the application will be handled by the RootLogger.
Finding #4: WebSphere uses RootLogger callback for activity log
Looking at the architecture diagram of WebSphere’s Logging implementation shows that their activity log registers a logging handler with the Root Logger. This means that EVERY LOG message that is logged through the Root Logger will be written to the synchronized activity.log file. Having a lot of custom application log output will therefore result in high synchronization overhead.
Finding #5: 99% Performance Contribution of Method that writes log output
Now that we know that every custom log message will end up in the synchronized activity log it is time to look at the actual performance impact on those methods that use the log. The following chart shows the time spent in printSPParams – one of the two custom methods that we identified earlier. The method contributes about 99% to the overall end user response time and – to not much surprise – most of this time is spent in sync:
Finding #6: Overhead of Lock Exceptions
Another interesting performance metric was the number of times this method gets called as well as the number of Lock Exceptions thrown. This shows the real extend of this problem over time:
You may also want to read up on the Performance Impact of Exceptions and why it is important to eliminate any exception that is not necessary. Now let’s see what they did to fix this problem.
Step 2: Short and Long Term Fix Approach
As pointed out in the introduction of this blog – the intention of the engineering team was not to have these info log messages being written to disk. The team members decided to go with a custom Logger instead of their otherwise used Log4J logging framework. Due to the way WebSphere’s activity log works their approach ended up causing this performance problem.
In collaboration, the teams came up with a short and long term solution for this problem. Short Term was to disable the activity log feature. This allowed them to proof that the activity log is really the root cause of this performance problem. The following charts proofs this as the number of thrown lock exceptions drops to zero and average execution time of printSPParam drops by 50% which results in calculated savings of 8.3 hours of response time per day:
The Long Term fix for this problem is to change the application code by using a different Logger and not the Root Logger as they also want to leverage activity log for its intended purpose.
Another possible change is to change the activity logs log level to something higher than INFO. In a production environment it is a best practice to focus on critical log output such as warning, severe, error or fatal – but not producing large log files with information that nobody needs anyway. The log level issue however only turned out to be a problem because of the side effects the logging strategy of the engineers had.
Step 3: Preventing by applying Performance Best Practices
Excessive logging is a common problem pattern we often see when applications are moved from test to production. Our blog Top Performance Problems when moving from Test to Production: Excessive Logging covers additional scenarios of things that are often forgotten, e.g: running with incorrect log level or enabling verbose log options of application servers.
Preventing these problems can be done looking at Key Performance Indicators (KPIs) in development and it test before deploying an application to production:
- Number of log entries written
- Size of log files
- Number of Exceptions
- Response Time of specific Methods
The following screenshot shows these KPIs analyzed per executed unit test in your continuous integration environment. Regressions on these metrics allow you to immediately address e.g: log spamming or performance degradations of key methods:
Performance Focus DevOps to the Rescue!
We strongly believe that closer collaboration and active data and tool sharing between Dev, Test and Ops help to prevent problems like this:
- Developers need insight on the impact of their log activities when it gets deployed to a large production environment. It is important that their continuous test environment uses production like settings which would show that they generate large log files for even small transactions.
- Testers need access to production like environments to test the impact of synchronized resources in large JVM clusters and making sure that apps are tested with production-ready settings (correct log levels, production service features such as activity log, …)
- Operations need to provide technical deep enough data for Dev and Test to analyze problems like this in case they make it in production.
Interested in more? Check out some of our ideas we have to level-up R&D organizations: I am not Scared of DevOps and You Shouldn’t be Either or DevOps is Changing the M in APM.