Why SLAs on Request Errors do not work – and what you should do instead
We often see request error rates as an indicator for SLA compliance. Reality however shows that this draws a wrong picture.
Let’s start with an example.
We had a meeting with a customer and were talking about their SLA and what it is based on. Like in many other cases the request error rate was used and the actual SLA they agreed on was 0.5%. From the operations team we got the input that at the moment they have a request error rate of 0.1%. So they are far below the agreed value. The assumption from current rate is that every 1000th customer has a problem while using the website. Which really sounds good but is this assumption true or do more customers have problems?
Most people assume that a page load equals a single request, however if you start thinking about it you quickly realizes that this is of course not the case. A typical page consists of multiple resource requests. So from now on we focus on all resource requests.
Let’s take a look at a typical eCommerce example. A customer searches for a certain product and wants to buy it in our store. Typically he will have to walk through multiple pages. Each click will lead to a page load which executes multiple resource requests or execute one or more AJAX requests. In our example the visitor has to go through at least seven steps/pages starting at the product detail page ending up with on the confirmation page.
The report shows the total Request Count per page. The shortest possible click path for a successful buy leads to 317 resource requests. To achieve a good user experience we need to deliver the resources fast and without any errors. However if we do the math for the reported error rate:
Customers with Errors = 317 requests * 0.1% = 31.7%
That means that on average every third user will have at least one failing request – and it doesn’t even violate our SLA!
The problem is that our error rate is independent from the number of requests per visiting customer. Therefore the SLA does not reflect any real world impact. Instead of a request failure rate we need to think about failed visits. The rate of failed visits has a direct impact on the conversion rate and thus the business. As such it is a much better KPI. If you ask again your operations team for this, most will not be able to give you the exact number. This is not a surprise as it is not easy to correlate independent web requests together to a visit.
This allows us to count errors and severe failures separately on a per page action or visit basis. In our case a page action is either a page load (including all resource requests) or a user interaction (including all resource and AJAX requests). A failed page action is like saying the content displayed in the browser is incomplete or even unusable and the user will not have a good experience.
Therefore instead of looking at failed requests it is much better to look at failed page actions.
When talking about User Experience we are however not only interested in single pages but in whole visits. We can tag visits that have errors as non-satisfied and visits that abandoned the page after an error as frustrated.
Such a failed visit rate draws a more accurate picture of reality, the impact on the business and in the end whether we need to investigate further or not.
SLA’s on request failure rate is not enough. One might even say it is worthless if you really want to find out how good or bad the user experience is for your customers. It is more important to know the failure rate per visit and you should think about defining SLA on this value. In addition we need to define which failed requests constitute a failed visit and are of high priority. This allows us to fix those problems with real impact and improve the user experience quickly.