Evolving an APM Strategy for the 21st Century
I started in the web performance industry – well before Application Performance Management (APM) existed – during a time when external, single page measurement ruled the land. In an ecosystem where no other solutions existed, it was the top of the data chain to support the rapidly evolving world of web applications. This was an effective approach to APM, as most online applications were self-contained and, compared to the modern era, relatively simple in their design.
Soon, a new solution rose to the top of the ecosystem – the synthetic, multi-step business process, played back either in a browser or a browser simulator. By evolving beyond the single-page measurement, this more complex data collection methodology was able to provide a view into the most critical business processes, delivering repeatable baseline and benchmark data that could be used by operations and business teams to track the health of the applications and identify when and where issues occurred.
These multi-step processes ruled the ecosystem for nearly a decade, evolving to include finer detail, deeper analytics, wider browser selection, and greater geographic coverage. But, like anything at the apex of an ecosystem, even this approach began to show that it couldn’t answer every question.
In the modern online application environment, companies are delivering data to multiple browsers and mobile devices while creating increasingly sophisticated applications. These applications are developed using a combination of in-house code, commercial and open source packages and servers, and outside services to extend the application beyond what the in-house team specializes in.
This growth and complexity means that the traditional, stand-alone tools are no longer complex and “smart” enough to help customers actually solve the problems they face in their online applications. This means that a new approach, the next step in APM evolution, was needed to displace the current technologies at the top of the pyramid.
This ecosystem, with multiple, sometimes competing, data streams makes it extremely difficult to answer the seemingly simple question of What is happening?, and sometimes nearly impossible to answer the important question of And why does it matter to us?
Let’s walk through a performance issue and show how the approach to APM has evolved to adapt to the complex ecosystem, and why we find that it requires a sophisticated, integrated approach to allow the flood of data to turn into a concentrated stream of actionable information.
Starting with synthetic data, we already have two unique perspectives that provide a broader scope of data than the traditional datacenter-only approach. By combining Backbone (traditional datacenter synthetic monitoring) with data from the Last Mile (data collected from end-user competitors running the same scripts that are run from the Backbone), the clear differences in performance appear, giving companies an idea that the datacenter-only approach needs to be extended by collecting data from a source that is much closer to the customers that use the monitored application.
Figure 3 Outside-In Data Capture Perspectives used to provide the user experience data for online applications
Using a real world scenario, let’s follow the diagnostic process of a detected issue from the initial synthetic errors to the deepest level of impact, and see how a new, integrated APM solution can help resolve issues in an effective, efficient, and actionable way.
Starting with a 3-hour snapshot of synthetic data, it is apparent that there is an issue almost halfway through this period, affecting primarily the Backbone measurements.
The clear cluster of errors (red squares in the scatter plot) around 17:30 is seen to be affecting Backbone only by filtering out the blue Last Mile measurements. After this filtering, zooming in allows us to quickly see that these errors are focused on the Backbone measurement perspective.
Examining the data shows that they are all script playback issues related to a missing element on the step, preventing the next action in the script from being executed.
But there are two questions that need to be answered: Why? and Does this matter? What’s interesting is that as good as the synthetic tool is, this is as far as it can go.This forces teams to investigate the issue further and replicate it using other tools, wasting precious.
But an evolved APM strategy doesn’t stop here. By isolating the time period and error, the modern, integrated toolset can now ask and answer both those questions, and extend the information to: Who else was affected?
In the above instance, we know that the issue occurred from Pennsylvania. By using a user-experience monitoring (UEM) tool that captures data from all incoming visitors, we can filter the data to examine just the synthetic test visit.
Already, we have extended the data provided by the synthetic measurement. By drilling down further, it immediately becomes clear what the issue was.
And then, the final step, what was happening on the server-side? Well, it is clear that one layer of the application was causing the issue and eventually the server timed out.
So, the element that was needed to make the script move forward wasn’t there because the process that was generating the element timed out. And when the agent decided to attempt the action, the missing element caused the script to fail.
This integrated approach has identified that the Click on ‘Chart’ action is one of potential concern and we can now go back and look at all instances of this action in the past 24 hours to see if there are visits that encountered a similar issue. And it is clear that this is a serious issue that needs to be investigated. The following screenshot shows all click on chart actions that experienced this problem including those from REAL users that are also impacted by this problem.
So, from an error on a Synthetic chart, we have quickly been able to move down to an issue that has been repeated multiple times over the past 24 hours affecting not only synthetic users, but also real users. Exporting all of this data and sending it to the QA and development teams will allow them to focus their efforts on the critical area.
This integrated approach has shown what has been proven in ecosystems all throughout the world, whether they are in nature or in applications: a tightly integrated group that seamlessly works together is far more effective than an individual. With many eyes, perspectives, and complementary areas of expertise, the team approach has provided far more data to solve the problem than any one of the perspectives could have on its own.