Is sampling data as good as capturing all transactions in detail? At first glance having all transactions is better than just some of them – but let’s dig a little deeper.
If we’re not capturing all transactions – how do we select those we do capture? Or to be more precise, how do we select the transactions to be captured with full details vs. those with just high-level information.
One way of selecting the transactions you follow in depth would be to sample, meaning randomly select a certain percentage of transactions. For example you could choose to only follow every 50th transaction, resulting in a sample rate of 2%. While this reduces application overhead and the load on your monitoring solution – how can you be sure this really accurately represents your system? What if the slow request or error a user complained about is in the 98% you didn’t monitor?
Since application performance is a key driver for any APM solution picking those transactions with bad performance – meaning they are slow – is important. But what threshold are you using for “slow”? You could use any arbitrary value, but what value? Especially for newly deployed applications or applications with varying load patterns selecting such a value can be tricky.
You could use statistical measures to have the system baseline the slow threshold – e.g. using standard deviation. Those measures work best if the data is following a Gaussian distribution (also known as normal distribution) – which response times rarely do. Below is a snapshot of production traffic (around 6000 individual requests) showing a more commonly seen distribution pattern:
If your monitoring tool were to cover transactions that are slower than 3 standard deviations you would monitor only 1.8% of your traffic:
Going to just twice the standard deviation would increase coverage to 4% but you would still miss details on everything faster than 2480ms.
Why are all Transactions Needed?
So why would I need all transactions? Why isn’t it enough if I combine random sampling, errors and slow transactions as a data set for APM?
After you looked at the slowest transactions, identified the root cause and have a developer working on the fixes – what’s next? With an APM solution deployed wouldn’t it be nice to – let’s say work on improving the median response time, so the users overall get a more responsive and faster application?
What if your boss asks you for “the most bang for his buck”? He doesn’t care that 1.8% of your users are getting bad response times – he wants to make sure the large majority of the users are feeling the impact of his performance improvement efforts and spending? If we say the large majority is 90% of all requests this data set you would need for your analysis looks like this (covering around 5500 transactions):
You could also use additional data points for selecting the transactions to analyze, like excluding internal IPs (we all know that internal users often have different usage patterns than real end users) or only looking for certain geographic units or only certain types of requests or only request that hit a certain server or…
The list could go on and on here, once you have the slowest requests analyzed remember that APM is not just the 1.8% – it means all users with all their requests.
Below is the production traffic from one of our customers – 30 minutes with over 760,000 transactions, all captured with full details:
We have seen time and again that capturing all transactions – in depth – all the time is necessary to address technical as well as business questions. If you don’t you might quickly find yourself looking for another APM solution, once you solved the most pressing issues and worst performing transactions.
Read up on two examples where having all transactions was essential to make the right decision to fix technical problems as well as to see how business is impacted by performance:
- Safari Bug prevents users to add items to shopping cart
- Connecting Business Goals with IT Requirements
If you want to test this yourself – sign up for the 15 Day dynaTrace free Trial and let us know what you think.