Klaus Enzenhofer About the Author

Klaus Enzenhofer has over 5 years of experience and expertise in the field of Web Performance Optimisation and User Experience Management. He works as Technical Strategist in the Centre of Excellence team at Compuware APM. In this role he influences the development of the dynaTrace Application Performance Management solution focusing on Real User Monitoring of web and mobile applications. He is a regular speaker at technology conferences on Real User Monitoring and Performance related topics and has also written many articles and blogs which have been published in print and online publications.

Why SLAs on Request Errors do not work – and what you should do instead

We often see request error rates as an indicator for SLA compliance. Reality however shows that this draws a wrong picture.

Let’s start with an example.

We had a meeting with a customer and were talking about their SLA and what it is based on. Like in many other cases the request error rate was used and the actual SLA they agreed on was 0.5%. From the operations team we got the input that at the moment they have a request error rate of 0.1%. So they are far below the agreed value. The assumption from current rate is that every 1000th customer has a problem while using the website. Which really sounds good but is this assumption true or do more customers have problems?

Most people assume that a page load equals a single request, however if you start thinking about it you quickly realizes that this is of course not the case. A typical page consists of multiple resource requests. So from now on we focus on all resource requests.

Let’s take a look at a typical eCommerce example. A customer searches for a certain product and wants to buy it in our store.  Typically he will have to walk through multiple pages. Each click will lead to a page load which executes multiple resource requests or execute one or more AJAX requests. In our example the visitor has to go through at least seven steps/pages starting at the product detail page ending up with on the confirmation page.

Browser Performance Report from the AJAX Edition Premium Edition showing the resource requests per page of the buying process

Browser Performance Report from the AJAX Edition Premium Edition showing the resource requests per page of the buying process

The report shows the total Request Count per page. The shortest possible click path for a successful buy leads to 317 resource requests. To achieve a good user experience we need to deliver the resources fast and without any errors. However if we do the math for the reported error rate:

Customers with Errors = 317 requests * 0.1% = 31.7%

That means that on average every third user will have at least one failing request – and it doesn’t even violate our SLA!

The problem is that our error rate is independent from the number of requests per visiting customer. Therefore the SLA does not reflect any real world impact. Instead of a request failure rate we need to think about failed visits. The rate of failed visits has a direct impact on the conversion rate and thus the business. As such it is a much better KPI. If you ask again your operations team for this, most will not be able to give you the exact number. This is not a surprise as it is not easy to correlate independent web requests together to a visit.

Another thing that needs to be taken into account is the importance of a single resource request for the user experience.  A user will be frustrated if the photo of the product he wants to buy is missing or even worse if the page does not load at all. He might not care if the background image does not load and might even be happy if the ads do not pop up. This means we can define which missing resources are “just” errors and which constitute failed visits. Depending on the URI pattern we can distinguish between different resources and we can define a different severity for each rule. In our case we defined separate rules for CSS-Files, Images used by CSS, Product Images, JavaScript Resources and so forth.

Error rules for different resources within dynaTrace

Error rules for different resources within dynaTrace

This allows us to count errors and severe failures separately on a per page action or visit basis. In our case a page action is either a page load (including all resource requests) or a user interaction (including all resource and AJAX requests). A failed page action is like saying the content displayed in the browser is incomplete or even unusable and the user will not have a good experience.

Therefore instead of looking at failed requests it is much better to look at failed page actions.

The red portion of the bars represents failed page actions

The red portion of the bars represents failed page actions

When talking about User Experience we are however not only interested in single pages but in whole visits. We can tag visits that have errors as non-satisfied and visits that abandoned the page after an error as frustrated.

Visits by User-Experience

Visits by User-Experience

Such a failed visit rate draws a more accurate picture of reality, the impact on the business and in the end whether we need to investigate further or not.

Conclusion

SLA’s on request failure rate is not enough. One might even say it is worthless if you really want to find out how good or bad the user experience is for your customers.  It is more important to know the failure rate per visit and you should think about defining SLA on this value. In addition we need to define which failed requests constitute a failed visit and are of high priority. This allows us to fix those problems with real impact and improve the user experience quickly.

Comments

  1. “Customers with Errors = 317 requests * 0.1% = 31.7%”
    - soooo, if there is 2000 requests than I have 200% chances of error? ;)
    I don’t disagree with the article, but please don’t put in bold something that suggests that you didn’t have any probability classes ;)

  2. Nice information. Thanks for sharing….

  3. Thanks, I know it’s not perfect statistics ;-) but I think it helps to point out what I want to say.

  4. I don’t disagree with the article, but please don’t put in bold something that suggests. but I think it helps to point out what I want to say.
    Vicodin dosage

Comments

*


+ 8 = ten