Application Performance Monitoring in production – Step by Step Guide – Measuring a distributed system
Last time I explained logical and organizational prerequisites to a successful production level application performance monitoring. I originally wanted to look at the concrete metrics we need on every tier, but was asked how you can correlate data in a distributed environment, so this will be the first thing that we look into. So let’s take a look at the technical prerequisites of successful production monitoring.
Collecting data from distributed environment
The first problem that we have is the distributed nature of most applications. In order to isolate response time problems or errors we need to know which tier and component is responsible. The first step is to record response times on every entry and exit from a tier.
The problem with this is twofold. First the externalJira tier will host multiple different Services which will have different characteristics. This is why we need to measure the response time on that service level and not just on the tier level. We need to do this on both sides of the fence, otherwise we will run into an averaging problem. The second problem is that externalJira is called from different other tiers and not just one.
When we look at the picture we see that externalJira is called from three different tiers. These tiers sometimes call the same services on externalJira, but with vastly different parameters which leads to different response times of externalJira. We have a double averaging problem:
- different tiers calling different services on externalJira skewing the average
- different tiers calling the same service on externalJira with different parameters skewing the average
Let’s look at this in a little more detail with following example
|Tier Entry Point||Avg. Tier Response Time||Service 1 Response Time||Service 2 Response Time||Service 3 Response Time|
|100x Payment 1||1000ms||305ms||403ms||102ms|
|200x Payment 2||1500ms||607ms||151ms||202ms|
|300x Service 1||500ms||101ms|
|600x Service 2||166ms||52ms|
|400x Service 3||89ms|
In this table we see which Tier Entry Point calls which Services on other tiers. The Payment 1 Service calls Services 1-3 and measures the response time on its side. The Payment 2 Service calls the same three services but with very different response times. When we look at the times measured on Services 1-3 respectively we will see a completely different timing. We did measure the response time of Service 1-3 irrespective of their calling context and ended up with an average! Service 1 does not contribute 500ms to the response times of either Payment 1 or 2, but the overall average is 500 ms. This average becomes more and more useless the more tiers we add. One of our biggest customers hits 30 JVMs in every single transaction. In such complex environments quick root cause isolation is nearly impossible if you only measure on a tier by tier basis.
In order to correlate the response times in a complex system we need to retain the transaction context of the original caller. One way to solve this is to trace transactions, either by using a monitoring tool that can do that or by modifying code and build it into the application.
Correlating data in a distributed environment
HTTP uses a concept called referrer that enables a webpage to know from which other pages it was called. We can use something similar and leverage this to do our response time monitoring. Let’s assume for the moment that the calls done in our imaginary Application are all WebService HTTP calls. We can then either use the referrer tag or some custom URL query parameter to track the calling services. Once that is achieved we can track response time based on that custom property. Many Monitoring tools allow you to segregate response time based on referrer or query parameters. Another possibility, as always, is to report this yourself via your own JMX Bean. If we do that we will get response that is context aware.
|Tier Entry Point||Referrer||Avg. Tier Response Time||Service 1 Response Time||Service 2 Response Time||Service 3 Response Time|
|100x Payment 1||1000ms||305ms||403ms||102ms|
|200x Payment 2||1500ms||607ms||151ms||202ms|
|100x Service 1||Payment 1||300ms||30ms|
|200x Service 1||Payment 2||600ms||100ms|
|100x Service 2||Payment 1||400ms||52ms|
|200x Service 2||Payment 2||150ms|
|100x Service 2||Payment 1, Service 1||25 ms|
|200x Service 2||Payment 2, Service 1||90 ms|
|100x Service 3||Payment 1, Service 2||50 ms|
We now see that Service 2 only calls Service 3 when it is called directly from Payment 1, which means its contribution is far less than the original table suggested.We also still see a difference in the request and the response time of the services. This is due to the involved network communication. By measuring the the response time context aware we can now also see the time that we spend in the communication layer more clearly, which enables us to isolate network bottlenecks and their impact. The average response time table did not allow us to do that.
We can push this to any level we want. E.g. we can divide the Payment 1 WebService call into its three variants supported by our shop: Visa, MasterCard, AMEX. If we push this as a tag/referrer down the chain we get an even more detailed picture of where we spend time.
The problem with this approach is that it is not agnostic to your application or the remoting technology. It requires you to change your code base and monitoring the different response times becomes more complicated with every tier you add. Of course also need to maintain this alongside the actual application features, which increases cost and risk.
This is where professional APM Tools come in. Among other things they do transaction tracing and tagging transparently without code changes. They can also split measure response time in a way that is context aware; they can differentiate between an AMEX and a Visa CreditCard payment via Business Transactions. And finally they allow you to focus on the entry response time, in our case Payment 1 and Payment 2. In case you have a problem, you can drill down to the next level from there. So there is no need to keep an eye on all the deeper level response times.
Beyond response time
By having the response time distribution across services and tiers we can quickly isolate the offending tier/service in case we face a performance problem. I stated before that monitoring must not only allow us to detect a problem but also isolate the root cause. To do this we need to measure everything that can impact the response time either directly or indirectly. In generally I like to distinguish between usage, utilization and impact measures.
Usage and Utilization Measurement
A usage measure describes how much a particular application or transaction uses a particular resource. A usage metric can usually be counted and is not time based. An exception is the maybe best known usage measure of CPU time. But CPU time is not really time based, it is based on cpu cycles; and there is a limited number of cpu cycles that can be executed in a specific time. We can directly measure how much CPU time is consumed by our request, by looking at the threads consumed CPU time. In addition we can measure the CPU usage on the process and system level. Most of the time we are measuring a limited resource and as such we also have a utilization measure, e.g. the CPU utilization of a system. Other examples include the number of database calls of a transaction or the connection pool usage. What is important is that the usage is a characteristic of the transaction and does not increase if performance goes down. If the specific resource is fully utilized, we have to wait for it, but then will use the same amount we always do!
To illustrate that we can again think about the CPU time. An AMEX Credit Card payment Transaction will always use roughly the same amount of CPU time (unless there is an severe error in your code). If the CPU is utilized the response time will go up, because it has to wait for CPU, but the amount of CPU time consumed will stay the same.This is what is illustrated in the chart. The same should be true if you measure the amount of database statements executed per transaction, how many webservices were called or how many connections were used. If a usage measure has a high volatility then you either are not measuring on a granular enough business transaction level (e.g. AMEX and Visa Payment may very well have different usage measures) or it is an indicator for an architectural problem within the application. This in itself is, of course, useful information for the R&D department. The attentive reader will note that caches might also lead to volatile response times, but then they should only do so during the warmup phase. If we still have high volatility after that due to the cache, then the cache configuration is not optimal.
The bottom line is that a usage measure is ideally suited to measure which sort of transactions utilize your system and resources the most. If one of your resources reaches 100% utilization you can use this to easily identify which transactions or applications are the main contributors. With this information you can plan capacity properly or change the deployment to better distribute the load. Usage measures on a transaction level are also the starting point for every performance optimization activity and are therefore most important for the R&D department.
Unfortunately the very fact that makes a usage measure ideal for performance tuning, makes it unsuitable for troubleshooting in case of scalability or performance problems in production. If the connection pool is exhausted all the time, we can assume that it has a negative impact on performance, but we do not know which transactions are impacted. Turning this around means that if you have a performance problem with a particular transaction type you can not automatically assume that the performance would be better if the connection pool were not exhausted!
The response time will increase, but all your transaction usage measures will stay the same, so how do you isolate the root cause?
In contrast to usage measures, impact measures are always time based. We measure the time that we have to wait for a specific resource or a specific request. An example is the getConnection method in case of a database connection pool. If the connection pool is exhausted the getConnection call will wait until a connection is free. That means if we have a performance problem due to an exhaustion of the database connection pool, we can measure that impact by measuring the getConnection method. The important point is that we can measure this inside the transaction and therefore know that it negatively impacts the AMEX , but not the Visa transactions. Other examples are the execution time of a specific database statement. If the database slows down in a specific area we will see this impact on the AMEX transaction by measuring how long it has to wait for its database statements.
When we take that thought further we will see that every time a transaction calls a different service, we can measure the impact that the external service has by measuring its response time at the calling point. This closes the cycle to our tier response times and why we need to measure them in a context-aware fashion. If we would only measure the overall average response time of that external service we would never know the impact it has on our transaction.
This brings us to a problem that we have with impact measurement on the system level and in general.
Lack of transaction context
In an Application Server we can measure the utilization of the connection pool, which tells us if there is a resource shortage. We can measure average wait time and/or the average number of threads waiting on a connection from the pool, which similar to the Load Average tells us that the resource shortage does indeed have an impact on our application. But both the usage and the impact measure lack transaction context. We can correlate the measurements on time basis if we know which transactions use which resources, but we will have to live with a level of uncertainty. This forces us to do guesswork in case we have a performance problem. Instead of zeroing in on the root cause quickly and directly, this is the main reason that trouble shooting performance problems takes a long time and lots of experts. The only way to avoid that guesswork is to measure the impact directly in the transaction, by either modifying the code or use a motoring tool that leverages byte-code injection and provides a transaction context.
Of course there are some things that just cannot be measured directly. The CPU again is a good example. We cannot directly measure that we wait for a CPU to be assigned to us, at least not easily. So how do we tackle this? We measure the usage, utilization and indirect impact. In case of the CPU the indirect impact is measured via the Load Average. If the load average indicates that our process is indeed waiting for CPU we need to rule out any other cause for the performance problem. We do this by measuring the usage and impact of all other resources and services used by the application. To quote Sherlok Holmes: ” If all directly measured root causes can be ruled out than the only logical conclusion is that the performance problem is caused by whatever resource cannot be directly measured.” In other words if nothing else can explain the increased response time, you can be sufficiently certain that the CPU exhaustion is the root cause.
What about log files
As a last item I want to look at how to store and report the monitoring data. I was asked before whether log files are an option.The idea was to change the code and measure the application from inside (as hinted at several times by me) and write this to a log file. The answer is a definitive NO; log files are not a good option.
The first problem is the distributed nature of most applications. At the beginning I explained how to measure distributed transactions. It becomes clear that while you can write all this information to a log file periodically, it will be highly unusable, because you would have to retrieve all the log files, correlate the distributed transaction manually and on top of that correlate it with system and application server metrics taken during that time. While doable, it is a nightmare if we talk about more than two tiers.
The second problem is lack of context. If you only write averages to the log file you will quickly run into the averaging problem. One can of course refine this to no end, but it will take a long time to reach a satisfactory level of granularity and you will have to maintain this in addition to application functionality, which is what should really matter to you. On the other hand if you write the measured data for every transaction you will never be able to correlate the data without tremendous effort and will also have to face a third problem.
Both logging all of the measures and aggregating the measures before logging them will lead to overhead which will have a negative impact on performance. On the other hand if you only turn on this performance logging in case you already have a problem, we are not talking about monitoring anymore. You will not be able to isolate the cause of an already occurred problem until it happens again.
The same is true if you do that automatically and e.g. automatically start capturing data once you realize something is wrong. It sounds intuative, but it really means that you already missed the original root cause of why it is slow.
On the other hand log messages often provide valuable error or warning information that is needed to pinpoint problems quickly. The solution is to capture log messages the same way that we measure response times and execution counts. Within the transaction itself. This way we get the valuable log information within the transaction context and do not have to correlate dozens of log files manually.
In addition a viable Monitoring solution must store, aggregate and automatically correlate all retrieved measurements outside the monitored application. It must store it permanently, or at least for some time. This way you can analyze the data after the problem happened and do not have to actively wait until it happens again.
By now it should be clear why we need to measure everything that we can in the context of the calling transaction. By doing this we can create an accurate picture of what is going on. It enables us to rule out possible root causes for a problem and zero in on the real cause quickly. It also enables us to identify resource and capacity issues on an application and service level instead of just the server level. This is equally important for capacity planning and cost accounting.
As a next step we will look at the exact metrics we need to measure in each tier, how to interpret and correlate them to our transaction response time.