Michael Kopp About the Author

Michael is aTechnical Product Manager at Compuware. Reach him at @mikopp

Why you really do Performance Management in production.

Often performance management is still confused with performance troubleshooting. Others think that performance management in production is simply about system and JVM level monitoring and that they are already doing APM.

The first perception assumes that APM is about speeding up some arbitrary method performance and the second assumes that performance management is just about discovering that something is slow. Neither of these two is what we at dynaTrace would consider prime drivers for APM in production. So what does it mean to have APM in production and why do you do it?

The reason our customers need APM in their production systems is to understand the impact that end-to-end performance has on their end users and therefore their business. They use this information to optimize and fix their application in a way that has direct and measurable ROI. This might sound easy but in environments that include literally thousands of JVMs and millions of transactions per hour, nothing is easy unless you have the right approach!

Therefore real APM in production answers the questions and solves problems such as the following:

  • How does performance affect the end users buying behavior or the revenue of my tenants?
  • How is the performance of my search for a specific category?
  • Which of my 100 JVMs, 30 C++ Business components and 3 databases is participating in my booking transaction and which of them is responsible for my problem?
  • Enable Operations, Business and R&D to look at the same production performance data from their respective vantage points
  • Enable R&D to analyze production level data without requiring access to the production system

Gain End-to-end Visibility

The first thing that you realize when looking at any serious web application – pick any of the big e-commerce sites – is that much of the end user response time gets spent outside their data center. Doing performance management on the server side only, leaves you blind to all problems caused due to javascript, CDNs, third-parties or, in case of mobile users, simply bandwidth.

Web Delivery Chain

Web Delivery Chain

As you are not even aware of these, you cannot fix them. Without knowing the effect that performance has on your users you do not know how performance affects your business. Without knowing that, how do you decide if your performance is ok?

This dashboard shows that there is a relationship between Performance and Conversion Rate

This dashboard shows that there is a relationship between Performance and Conversion Rate

The primary metric on the end user level is the conversion rate. What End-to-End APM tells you is how application performance or non-performance impacts that rate. In other words, you can put a dollar number on response time and error rate!

Thus the first reason why you do APM in production is to understand the impact that performance and errors have on our users’ behavior.

Once you know the impact that some slow request has on your business you want to zero in on the root cause, which can be anywhere in the web delivery chain. If your issue is on the browser side, the optimal thing to have is the exact click path of the effected users.

A Visits Click path plus the PageAction PurePath of the first click

A Visits Click path plus the PageAction PurePath of the first click

You can use this to figure out if the issue is in a specific server side request, related to third party requests or in the java script code. Once you have the click path, plus some additional context information, a developer can easily use something like the AJAX Edition to analyze it.

If the issue is on the server side we need to isolate the root cause there. Many environments today encompass several hundred JVMs, CLRs and other components. They are big, distributed and heterogeneous. To isolate a root cause here you need to be able to extend the click path into the server itself.

From the Click path to Server Side

From the Click path to Server Side

But before we look at that, we should look at the other main driver of performance management – the business itself.

Create Focus – It’s the Business that matters

One problem with older forms of performance management has been the disconnects from the business. It simply has no meaning for the business whether average CPU on 100 servers is at 70% (or whatever else). It does not mean anything to say that JBoss xyz has a response time of 1 second on webpage abc. Is that good or bad? Why should I invest money to improve that? On top of this we don’t have one server but thousands with thousands of different webpages and services all calling each other, so where should we start? How do we even know if we should do something?

The last question is actually crucial and is the second main reason why we do APM. We combine End User Monitoring with Business Transaction Management. We want to know the impact that performance has on our business and as such we want to know if the business performance of our services are influenced by performance problems of our applications.

While End User Monitoring enables you to put a general dollar figure on your end user performance, business transactions go one step further. Let’s assume that the user can buy different products based on categories. If I have a performance issue I would want to know how it affects my best selling categories and would prioritize based on that. The different product categories trigger different services on the server side. This is important for performance management in itself as I would otherwise look at too much data and could not focus on what matters.

The Payment Transaction has a different path depending on the context

The Payment Transaction has a different path depending on the context

Business Transaction Management does not just label a specific Web Request with a name Booking, but really enables you to do performance management on a higher level. It is about knowing if and why revenue of one tenant is affected by the response time of the booking transaction

In this way Business Transactions create a twofold focus. It enables the business and management to set the right focus. That focus is always based on company success, revenue and ROI. At the same time Business Transactions enable the developer to exclude 90% of the noise from his investigation and immediately zero in on the real root cause. This is due to the additional context that Business Transaction bring. If only bookings via Credit Cards are affected, then diagnostics should focus on only these and not all booking transactions. This brings me to the actual diagnosing of performance issues in production.

The Perfect Storm of Complexity

At dynaTrace we regularly see environments with several hundred or even over thousand WebServers, JVMs, CLRs and other components running as part of a single application environment. These environments are not homogeneous. They include native business components, integrations with for example Siebel or SAP and of course the mainframe. These Systems are here to stay and their impact on the complexity of today’s environments cannot be underestimated. Mastering this complexity is another reason for APM.

Today’s systems serve huge user bases and in some cases need to process millions of transactions per hour. Ironically most APM solutions and approaches will simply break down in such an environment, but the value that the right APM approach brings here is vital. The way to master such an environment is to look at it from an application and transaction point of view.

Monitoring of Service Level Agreements

Monitoring of Service Level Agreements

SLA Violations and Errors need to be detected automatically and the data to investigate needs to be captured, otherwise we will never have the ability to fix it. The first step is to isolate the offending tier and find out if the problem is due to host, database, JVM, the mainframe a thirdparty service or the application itself.

Isolating the Credit Card tier as the root cause

Isolating the Credit Card tier as the root cause

Instead of seeing hundreds of servers and millions of data points we can immediately isolate the one or two components that are responsible for your issue. Issues happening here cannot be reproduced in a test setup. This has nothing to do with lack of technical ability, we simply do not have the time to figure out which circumstances lead to a problem. So we need to ensure that we have all the data we need for later analysis available all the time. This is another reason why we do APM. It gives us the ability to diagnose and understand real world issues.

Once we have identified the offending tier, we know whom to talk to and that brings me to my last point, collaboration.

Breaking the Language Barrier

Operations is looking at SLA violations and uptime of services, the business is looking at revenue statistics of sold products and R&D is thinking in terms of response time, CPU cycles and garbage collection. It is a fact that these three teams talk completely different languages. APM is about presenting the same data in those different languages and thus breaking the barrier.

Another thing is that as a developer you never get access to the production environment, so you have a hard time analyzing the issues. Reproducing issues in a test setup is often not possible either. Even if you do have access, most issues can not be analyzed in real time. In order to effectively share the performance data with R&D we first need to capture and persist it. It is important to capture all transactions and not just a subset. Some think that you only need to capture slow transactions, but there are several problems with this. Either you need to define what is slow, or if you have base lining you will only get what is slower than before.The first is a lot of work and the second assumes that performance is fine right now. That is not good enough. In addition such an approach ignores the fact that concurrency exists. Concurrent running transactions impact each other in numerous ways and whoever diagnoses an issue at hand will need that additional context.

A typical Operations to Development conversation without APM

A typical Operations to Development conversation without APM

Once you have the data you need to share it with R&D, which most of the time means to physically copy a scrubbed version of that data to the R&D team. While the scrubbed data must exclude things like credit card numbers, it must not loose its integrity. The developer needs to be able to look at exactly the same picture as operations. This enables better communication with operations while at the same time enabling deep dive diagnostics.

Now once a fix has been supplied operations needs to ensure that there are no negative side effects and will also want to verify that it has the desired positive effect. Modern APM solves this by automatically understanding the dynamic dependencies between applications and automatically monitoring new code for performance degradations.

Thus APM in production improves communication, speeds up deployment cycles and at the same time adds another layer of quality assurance. This is the final, but by far not least important reason we do APM.

Conclusion

The reason we do APM in production is not to fix a CPU hot spot, speed up a specific algorithm or improve garbage collection. Neither the business nor operations care about that. We do APM to understand the impact that the applications performance has on our customers and thus our business. This enables us to effectively invest precious development time where it has the most impact and thus furthering the success of the company. APM truly serves the business of a company and its customers, by bringing focus to the performance management discipline.

My recommendation: If you do APM in production, and you should, do it for the right reasons.

Comments

  1. Very nice stuff.

    Thanks for sharing!

  2. excellent article. It’s a fact that most companies overlook APM in production due to budget concerns,lack of support and skills,right tools etc.
    RCA and resolution becomes easy if we engage a lightweight solution with maximum coverage.

Comments

*


nine + = 18