Why wait for the disaster to happen?
A couple of weeks ago we helped a customer get more visibility into his production environment with Compuware dynaTrace. He was already using it in other environments, so he knew what he was getting. Anyway, a couple of hours later I read a tweet from him: “Just hooked up dynaTrace. Can’t wait for tomorrows traffic “. That got me thinking. A large part of monitoring production systems, verifying deployments or verifying fixes is essentially to wait and see if a problem occurs. Why should we wait? Why not identify problems before disaster strikes?
Step #1: Identify Production Stability Issues or Slow Downs
The number one use case for Operations is of course to make sure the system is stable and available. Typically IT operators have various metrics across the whole application landscape, coupled with a myriad of alerts that should notify them of anything that might cause the system to become unstable. You might have noticed the very vague nature of this statement. It’s because that’s what it usually is. Some solutions provide fancy and complex event processing engines with the goal of identifying abnormal behavior. In reality, most problems are still caught in one of two ways:
- Crash: All of a sudden the proverbial sh?# hits the fan, transaction failures rise, JVMs run out of memory, exceptions fill up the log file, pages become unavailable… you know, your phone goes crazy and your lovely colleagues stand in your cubicle wanting to know what you intend to do
- Or even “better”, customers are calling to tell you that your service is slow or unavailable
All of our very sophisticated alerts do not seem to work (in many cases they spam our inbox and we ignore them). Now here is the good news: most instability issues lead to a slow down. To be more specific, they lead to:
- higher volatility in response time
- More slow transactions than we see normally
- A shift of the response time median or the 75% percentile to the right (slower)
These changes in the response time distribution seem minor when looking at average response times only (See Why Averages Suck for more details), but these response time variations become statistically significant if you use a baseline approach that examines the overall distribution. It is important to understand that we would not simply react to slow requests, as there will always be some of those, but we are reacting to a behavioral change or overall degradation of the system. Quite often we can detect suspicious changes using this technique long before things really go south.
Step #2: Identify Root Cause of Instability
This question comes up again and again: What should we do once we have identified an instability or slow down?
If our system alerts us to baseline violations due to a change in the response time distribution or higher rate of failures — something has obviously changed! We can simply identify which component is responsible for that change. This is called Fault Domain Isolation. From an application view this will always be a particular application, service or communication point in the transaction flow; something slowed down, has an increased failure rate or changed its response time distribution.
Once we have identified that component, the next thing to do is identify the high-level root cause: network, infrastructure, JVM, application or middle ware such as a database. The IT operator’s primary goal is not to understand what is happening, but, quite frankly, to bounce the respective component before the users or SLAs are affected and hope that that will solve the issue. That is one of the frustrating things for developers, but Operations needs to make sure that the system is stable again quickly; everything else comes a distant second.
In case of a network slow down or congestion, I will first check whether an unusual rise in transaction throughput is responsible or if the network interfaces on the specific hosts are maxed out.
In either case I can add another node to the cluster in question. If I can detect neither of the two potential root causes I will turn to my network guy.
In case of host/VM issues I turn it over to my infrastructure people to see if the VM is starving and needs adjustment.
If the JVM runs into garbage collection issues or the application gets erratic, it’s a little bit more complicated, but typically an Operations team would bounce the JVM (hopefully part of a cluster) to remedy the immediate issue and leave the analysis for later.
From a developer or architect’s perspective, bouncing a JVM sounds quite harsh and something that generates impact (thus defeating the intent of preventing something bad from happening). From an operations perspective however, bouncing a JVM is fast, easy and, as long as the system is clustered and not 100% utilized, there is no end user impact.
The downside of this is of course that we cannot analyze the root cause in the actual live system. This is why we want to capture all transactions and correlated system metrics, 24×7. Otherwise we will never be able to understand what happened and will never be able to prevent the issue from happening again!
If the bounce doesn’t help, or is only a very short-lived solution, Operations would typically trigger an immediate analysis to be done by the owners of the troublesome component. Of course we should do that analysis even if the bounce seemed to help, but arguably it is then not an immediate necessity.
Step #3: Verify temporary “Fix/Workaround”
Assuming the DBA fixed the DB or the App Server guy bounced the JVM, we want to see whether the “fix” actually worked. Usually we would watch response time or other metrics and wait for the problem to occur again. This is where things can be optimized. Why wait? First of all, a distribution/percentile-based Baselining approach would know nearly immediately whether the response time distribution or error distribution goes back to normal.
The second step is to verify that the triggering service is behaving normally again. For that we can compare the transactional detail data from before the violation with the data during the violation to see the difference. Once we know that difference we can also immediately identify if those changes in behavior are gone after the fix. However if the behavioral changes did persist, then the fix, even if we currently see no symptoms regarding the response time, did not work and it will happen again. Now it is time for a more permanent fix.
Step #4: “Permanently Fix Application Issues”
This is where it usually gets tricky. We had a production issue, and we assume or know it is the application’s fault and the operations guys bounced it. How can we analyze the issue in order to fix the problem?
This is one of the main reasons why you want to have an APM system that provides full coverage (visibility and transaction capture) 24×7 and not just for slow transactions. If you have detailed data, you can use this to analyze and fix the issue even if the system itself got bounced. No more reading through tons of log files and no more checking different obscure scenarios that might lead to the observed symptoms. Instead we work based on facts!
It is also worth noting that just looking at the problematic transactions is not good enough. First off, it is easier to compare normal against abnormal transaction executions and understand the difference than it is to analyze bad ones and figure out what makes them bad! Secondly, the “slow” transactions might not even be a significant problem! In the real world every application has slow transactions. I am not saying we shouldn’t “fix” those, but they were always slow and as such are not our immediate concern! These slow transactions did not trigger the baseline violation, in other words they did not change! So we want to look at the bulk of normal transactions that degraded or became erratic, and for that we need to analyze and compare those instead of the “slow” ones! This is the reason why performance fixes quite often do not have the desired effect. The fix improved something, but not the issue at hand.
By comparing what changed, in user input, execution path and resource usage (CPU, database, I/O…) it is often quite easy to identify the root cause and subsequently fixing the issue.
Step #5: “Deploy and Verify Fix In Production”
Finally, we deploy the fix or a new version in production. Consequently we need to verify that the fix worked as intended. In many cases organizations rely on QA and testing to do their work. While this is of course necessary, it should not be considered a blank check.
I am not advocating that you should skop QA, but in many cases it is not enough, because QA can only simulate production environments or more specifically what they think is reality. More and more organizations simply deploy too often, weekly or even daily, and do not have the luxury of day or week-long QA cycles that would be needed to verify every possible impacted scenario. Even when thinking more traditionally, important patches cannot be delayed for days or weeks. Nevertheless we need to be sure that the application is stable in production. Traditionally operations and application owners would watch response time and other high level aggregated averages for any obvious signs that the patch didn’t work.
The distribution-based baselining algorithm again works nicely and does a better job. We would immediately see the new distribution and get a better high-level understanding whether the fix had the intended impact on the end user experience.
At the same time we capture all transactions involving the patched component and monitor its behavior. The architect and developers can analyze the captured transaction and see if the patch leads to the desired run-time behavior. This is a huge step forward compared to how things used to be done (or are still done in many organizations). By looking at the actual run-time behavior down to communication flows, SQLs and even at the method-level we can verify that the application behaving as expected. Whereas watching aggregated metrics only tell us that we currently see no negative impact with no further detail!
If we did fix specific things (e.g. race conditions, synchronizations, memory, pooling) we can define appropriate metrics and thresholds for long term trending. This way we are able to verify that the patch works immediately, or conversely that it doesn’t.
Modern monitoring and APM solutions enable Operations to take a much more proactive approach. By capturing deep transaction detail, for all transactions 24×7 with all the facts instead of just monitoring symptoms, an organization can react before the customer is impacted, understand what is going on without preventing Ops from implementing temporary solutions like bouncing, speed up fix time dramatically and may most important of all: verify whether the investments were worth the effort. This not only improves the quality of the application, but also reduces maintenance costs. In the final analysis, being proactive like this will also reduce operations cost and prevent loss of revenue by helping you avoiding many a disaster instead of waiting for it to happen.