Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Software Quality Metrics for your Continuous Delivery Pipeline – Part I

How often do you deploy new software? Once a month, once a week or every hour? The more often you deploy the smaller your changes will be. That’s good! Why? Because smaller changes tend to be less risky since it’s easier to keep track of what has really changed. For developers, it’s certainly easier to fix something you worked on three days ago than something you wrote last summer. An analogy from a recent conference talk from AutoScout24 is to think about your release like a container ship, and every one of your changes is a container on that ship:

Your next software release en route to meet its iceberg

Your next software release en route to meet its iceberg

If all you know is that you have a problem in one of our containers you’d have to unpack and check all of them. That doesn’t seem to make sense for a ship, and neither does it for a release. But that’s still what happens quite frequently when a deployment fails and all you get is “it didn’t work.” In contrast, if you were shipping just a couple of containers you would be able to replace your giant, slow-maneuvering vessel with something faster and more agile—and if you’re looking for a problem, you’d only have to inspect a handful of containers. While adopting this practice in the shipping industry would be a rather costly approach, this is exactly what continuous delivery allows us to do: Deploy more often, get faster feedback, and fix problems faster.

A great example is Amazon, who shared their success metrics at Velocity:

Some impressive stats from Amazon showing the success of rapid continuous delivery

Some impressive stats from Amazon showing the success of rapid continuous delivery

However – even small changes can have severe impacts.  Examples?

  1. Heavy DOM Manipulations through JavaScript: Introduced through a “harmless” new JavaScript library for tracking link clicks
  2. Memory Leaks in Production: Introduced by a not well tested remote logging framework downloaded on GitHub
  3. Performance Impact of Exceptions in Ops: Ops and Dev did not follow the same deployment steps (lack of automation scripts) resulting in thousands of exceptions and maxes out CPU on all app servers

Extending your Delivery Pipeline

Even small changes need to be tracked and their impact on overall software quality must be measured along the delivery pipeline, so that your quality gates can stop even the smallest change from causing a huge issue. The 3 examples above could have been avoided when automatically looking at the following measures across the delivery pipeline and stopping the delivery when “architectural” regressions are detected:

  • The number of DOM manipulations
  • memory usage or object churn rate per transaction
  • number of exceptions, number of database queries or number of log entries.

In a series of blog posts I will introduce you to metrics that you have to measure along your pipeline to act as an additional quality measure mechanism in order to prevent problems listed above. It is important that:

  • Developers get these measurements in the commit stage
  • Automation Engineers need to measure them for the automated unit and integration tests
  • Performance Engineers add them to the load testing reports you do in staging
  • Operations verify how the real application behaves after a new deployment in production

For each metric I introduce, I’ll explain why it is important to monitor it, which types of problems can be detected and how Developers, Testers and Operations can monitor these metrics.

Metric: # of Requests per End User Action

How many web requests does it take to load your homepage, execute your search or perform another critical function of your application? You can use tools such as dynaTrace AJAX Edition, Firebug, SpeedTracer or network sniffing tools such as Fiddler to figure that out.

But – why should you look at this number? Last year we planned to upgrade the software that powers our community portal with the hope that the latest version of Confluence (which powers our community) which will be much faster (as promised in the release notes) as well as leveraging some of the new interactive features they had. We ran a test before and after the upgrade on our staging environment and looked at metrics such as number of simulated users and the number of requests being executed. The first test showed that about 200 requests were executed per user where each user clicked through 4 main pages of our system:

Loadtest on the old version showed us that about 200 requests were executed for each user that we simulated

Loadtest on the old version showed us that about 200 requests were executed for each user that we simulated

We had to abort the 2nd load test due to too many errors caused by overloaded servers. The most interesting observation was that the same 4 steps were now taking 400 requests per user – that’s TWICE the number:

Lots more JavaScript files and AJAX Requests introduced by new functionality caused the explosion of web requests

Lots more JavaScript files and AJAX Requests introduced by new functionality caused the explosion of web requests

How did that happen? Because of all the new interactive functionality many additional JavaScript files were loaded that then also made several more AJAX calls.

How to Measure on Dev Workstations

Developers can look at these metrics on their local workstations. They probably already know tools like dynaTrace AJAX Edition, Firebug or Speed Tracer. The following is a screenshot of one of these tools that highlights the key metrics for a single page – in this case it is the homepage of our community after the upgrade:

Key Metrics: 108 requests alone for the homepage, 14 XHR requests and a total size of 4MB

Key Metrics: 108 requests alone for the homepage, 14 XHR requests and a total size of 4MB

Especially web developers should be familiar with the Best Practices around Web Performance Optimization. If they see these measures that should think twice before checking in code.

How to Measure in Continuous Integration

Automation Engineers can use the same tools listed above in combination with automated testing tools such as Selenium, Silk, QuickTest, etc.

The key is to capture these metrics for every test that is executed on every build and automatically identify regressions so that your CI build actually fails in case your number of requests jumps. The following is a screenshot from dynaTrace that automatically captures and analyzes these metrics for you and also alerts in case of a regression.

Automatically identify regressions from build to build by looking at metrics such as # of Requests, # of JavaScript files, …

Automatically identify regressions from build to build by looking at metrics such as # of Requests, # of JavaScript files, …

How to Measure in Load Testing

Performance Engineers use Load Testing Tools which typically provide this metric out of the box. Even though they deliver metrics such as # of Transactions or # of Pages they typically also provide the number of actual web requests. If your tool doesn’t provide that data you can analyze the number of requests on your server-side. Looking at the web server logs is one option but makes it a bit hard to figure out which requests came in through which page load. Application Performance Management solutions typically provide a “Page Context” or “User Action” context that allows assigning requests on your web or app server to an individual real or simulated user.

The following is a screenshot from dynaTrace that provides this data for each individual load test step – making it easy to figure out how many web requests are actually processed in the different stages of the test.

Learn about the performance characteristics and hotspots of individual test steps under increasing load

Learn about the performance characteristics and hotspots of individual test steps under increasing load

Even more interesting than observing a single load test is to compare two tests with each other, e.g.: two tests executed against two different builds to identify the changes. The following table shows the difference in the number of requests executed between two tests with identical transactions. The current test shows a severe degradation in performance and a much higher number of requests executed for the same steps:

A bad code change increases the number of web requests dramatically and also impacts response time

A bad code change increases the number of web requests dramatically and also impacts response time

How to Measure in Production

So what does this mean for Operations? As we can’t test every page and possible user interaction it is important to measure the same metric in “the real world.” Having a Real User Monitoring solution in place gives you metrics such as Number of Visitors and Number of User Actions, e.g.: “Loading a Page”, or “Clicking on a Link”. Connecting the User Action with the actual web requests requested by the browser gives you the exact metric we’re looking for. The following screenshot shows what this measure looks like at an unnamed shopping site:

A jump in number of web requests executed by visitor immediately tells us that something has changed in the deployment

A jump in number of web requests executed by visitor immediately tells us that something has changed in the deployment

The jump we see at 7:40 can now trigger a rollback, in case response time of end users is also impacted or people start complaining. It can also trigger a faster fix to the problem by providing this information to engineering to see what is different between production and development and get a new version deployed as fast as possible (=Roll Forward).

Make sure to also monitor 3rd parties: a jump in the number of requests can also mean that you switched your ad provider or you have a problem with your CDN configuration. Examples?

  1. You only control 1/3 of your Page Load Performance: You decided to add social plugins or switch Ad Provider? The impact could be a significant increase in JavaScript, CSS and Images loaded, impacting your load times and number of resources
  2. When your CDN doesn’t deliver what it promises: A misconfigured CDN may not deliver your images with the correct browser cache headers or is forwarding too many requests to your data center instead of delivering it from the cache

What Does This Mean for You?

This is the first metric – number of requests per visitor – that is important along the delivery pipeline. Here is the key takeaway for your specific role:

  • Developers: Understand the impact of introducing a new JavaScript library or a new feature on the actual number of requests executed by the browser. There are plenty of tools available to provide this data in an on demand or automated scenario.
  • Performance Engineers: Make sure to test your applications from real browsers and different locations. Only then can you make sure that you are testing all involved components such as 3rd party providers.
  • Production and Business: Not every use case can be tested – therefore it is important to have this type of monitoring in place. Report this data to your engineering team so that they learn about the impact of implementation changes. Also make sure to understand the impact of 3rd party components.

The next metrics we are going to look into are server-side architectural metrics such as # of Database Statements, # of Exception or # Log Messages written. Stay tuned and feel free to comment below with your own metrics.

Comments

  1. Often automated testing efforts focus too much on discrete tests that validate very specific expected outcomes. This article does a great job of reminding us how important it is to also measure “global” application metrics collected during static and repeated basic transactions across revisions. These tests can reveal unexpected regressions that indicate underlying problems which can be difficult or even impossible to test directly.

Comments

*


2 + two =