Roman Spitzbart About the Author

Roman Spitzbart has been focusing on APM for the last 5 years. Currently Roman is the EMEA sales engineering manager for dynaTrace engaging prospects and customers to understand Compuware's APM solutions and helping to unlocking its value. Previously he had various roles within the enablement services team of Compuware working with customers around the globe on APM solution deployments. Prior to joining Compuware Roman worked for GoldenSource in New York, leading large customer implementations along the US east coast.

APM Myth Busters: Sampling is as Good as Having all Transactions

Is sampling data as good as capturing all transactions in detail? At first glance having all transactions is better than just some of them – but let’s dig a little deeper.

If we’re not capturing all transactions – how do we select those we do capture? Or to be more precise, how do we select the transactions to be captured with full details vs. those with just high-level information.

Sampling

One way of selecting the transactions you follow in depth would be to sample, meaning randomly select a certain percentage of transactions. For example you could choose to only follow every 50th transaction, resulting in a sample rate of 2%. While this reduces application overhead and the load on your monitoring solution – how can you be sure this really accurately represents your system? What if the slow request or error a user complained about is in the 98% you didn’t monitor?

Errors

So what if we add monitoring those transactions that had errors? For web applications we can use HTTP error codes to determine failures (e.g. 404 or 500 errors). For other types of applications looking for warning/error log messages and exceptions is a good indicator. If your APM solution does include end-user monitoring you could even add client-side JavaScript errors. But we still don’t ensure we are getting details for the slow transactions – just random sampling and errors.

Slow Transactions

Since application performance is a key driver for any APM solution picking those transactions with bad performance – meaning they are slow – is important. But what threshold are you using for “slow”? You could use any arbitrary value, but what value? Especially for newly deployed applications or applications with varying load patterns selecting such a value can be tricky.
You could use statistical measures to have the system baseline the slow threshold – e.g. using standard deviation. Those measures work best if the data is following a Gaussian distribution (also known as normal distribution) – which response times rarely do. Below is a snapshot of production traffic (around 6000 individual requests) showing a more commonly seen distribution pattern:

The statistical average (red) is 1050ms, the median or 50th percentile (orange) is between 800 and 850ms and the standard deviation is 711ms

The statistical average (red) is 1050ms, the median or 50th percentile (orange) is between 800 and 850ms and the standard deviation is 711ms

If your monitoring tool were to cover transactions that are slower than 3 standard deviations you would monitor only 1.8% of your traffic:

A common approach of looking at transactions slower than 3 times standard deviation will only cover y very small percentage of transactions

A common approach of looking at transactions slower than 3 times standard deviation will only cover y very small percentage of transactions

Going to just twice the standard deviation would increase coverage to 4% but you would still miss details on everything faster than 2480ms.

Why are all Transactions Needed?

So why would I need all transactions? Why isn’t it enough if I combine random sampling, errors and slow transactions as a data set for APM?

After you looked at the slowest transactions, identified the root cause and have a developer working on the fixes – what’s next? With an APM solution deployed wouldn’t it be nice to – let’s say work on improving the median response time, so the users overall get a more responsive and faster application?

What if your boss asks you for “the most bang for his buck”? He doesn’t care that 1.8% of your users are getting bad response times – he wants to make sure the large majority of the users are feeling the impact of his performance improvement efforts and spending? If we say the large majority is 90% of all requests this data set you would need for your analysis looks like this (covering around 5500 transactions):

It is better to focus and optimize those transactions that really impact the majority of end users

It is better to focus and optimize those transactions that really impact the majority of end users

You could also use additional data points for selecting the transactions to analyze, like excluding internal IPs (we all know that internal users often have different usage patterns than real end users) or only looking for certain geographic units or only certain types of requests or only request that hit a certain server or…

The list could go on and on here, once you have the slowest requests analyzed remember that APM is not just the 1.8% – it means all users with all their requests.

Below is the production traffic from one of our customers – 30 minutes with over 760,000 transactions, all captured with full details:

Having all transactions gives you the real picture on your end user experience

Having all transactions gives you the real picture on your end user experience

Myth Busted?

We have seen time and again that capturing all transactions – in depth – all the time is necessary to address technical as well as business questions. If you don’t you might quickly find yourself looking for another APM solution, once you solved the most pressing issues and worst performing transactions.

Read up on two examples where having all transactions was essential to make the right decision to fix technical problems as well as to see how business is impacted by performance:

If you want to test this yourself – sign up for the 15 Day dynaTrace  free Trial and let us know what you think.

Comments

  1. Hi Roman,

    Well written article about sampling vs measuring all the time. I can use this to convince my customers to choose for the right tooling.

    Regards,

    Stephen

  2. Hi Roman,

    I think this article is providing two new myth. The myth that sampling means *randomly* picking the transaction to measure and that covering everything is better than sampling.

    In APM there is only one thing that is true for any solution: The number of measurements per second or minute is limited (aka physics). So in a highly frequented or distributed systems you cannot measure *every* transaction in *full* detail. This is simply impossible. There are good solutions that can measure more per second and maybe bad solutions that capture less and I am sure that dynaTrace is one of the better ones out there. In fact almost no APM vendor does open benchmarks – except maybe William: http://www.jinspired.org/satoris/benchmarks

    So the question is what to do with this problem? One approach is to capture every transaction and reduce the detail level of the transaction by manually instrumenting the agent or by doing this in a automatic and intelligent way. The other way is to reduce the number of transactions you measure (sampling). Also this can be done in a simple random way as you describe, it can also be done in a intelligent way using modern maschine learning algorithms etc to decide if a transaction is important or not. And as there is not black and white, you can combine both approached to have the best result.

    So I would say that you are absolutely right that “Sampling is as Good as Having all Transactions” is a myth but it is also a myth that “Having all transactions is as good or better than sampling.”

    The truth is that you have to be careful if you chose our APM tool. You have to know your application and requirements and you have to understand the technology you are evaluating in more detail than the marketing slides. The bigger your system is, the more complicated the evaluation.

    My favorite paper about this topic is the Dapper paper from Google: http://research.google.com/pubs/pub36356.html

    It even shows that in a very big system, you *have to* sample (otherwise the overhead would be much too big) and as you still have a lot of data it is even good enough to sample randomly -> aka as statistics :-)

    Mirko

    • Roman Spitzbart Roman Spitzbart says:

      Hi Mirko,

      Thanks for the comment!

      When you say automatically and intelligently deciding which transactions to follow in detail – what triggers other than following slow transactions or those with errors did you have in mind? That is at least what some of our competitors use combined with a few randomly selected “normal” transactions. As intelligent as the system might decide – it can never know what you need the data for. Actually when implementing your APM solution you might not even know if tomorrow you are asked to focus on slow transactions only or maybe improve the already fast login transaction even further. So if an APM solution can provide all transactions with low overhead – I would still argue that is better than just selecting some.

      I fully agree with your comment on carefully choosing your APM solution. Think about your requirements, what you need the solution to do now (often fixing the issue that prompted a buy-decision) but also what do you need from it in the next 6, 12 or 18 months (often going beyond the low hanging fruit of the slowest transactions). Don’t believe marketing materials – use the solution and evaluate it for your use cases!

      Roman

  3. Hi Roman,

    with “intelligent” I would mean that in best case you see all transactions that you need. This could mean that you apply any kind of statistics and maschine learning algorithms on the transactions, like K-Means, Fuzzy, linear regression, naive bayes, etc to find groups of transactions that are interesting, transactions that are not “normal” or stuff that just does not behave like expected or learned. We are currently experimenting with this kind agents and data in an internal project at codecentric and it works quite well – especially if you also take statistics from the servers, network etc into the statistics (classical Big Data approaches).

    Recording all transactions does also have tradeoffs, e.g. you cannot trace everything inside the transaction as the number of “measures per second” is limited (simply physics). So you have to reduce instrumentation etc also in an “intelligent” way .

    There is also another thing you should keep in mind: Recording every transaction can also produce a lot of data. We have a customer in Germany that has 2.000 JVMs controlled by *one* APM controller. They have 8 billion distributed transactions per day. So you say, that you can and would record these transactions with “full details”!? Let’s assume that you have 5.000 data points (method calls, parameters, SQLs, etc) per transaction (which is not too much in this case) and each data point has 20 bytes. Than you would record and store 8,000,000,000×5,200x20bytes per day – which are 727 TB per day!

    Despite that you would also need a lot of intelligence (like described above) to find the transactions that you are interested in the 8,000,000,000 purepathes you have recorded. How should a user do this? He need some “data mining” tools for this…

    So a simple approach could be to not record duplicated data. E.g. you have a request that is slow – you record it 5-10 times and you analyze that recording number 11…1.000.000 of the same request does not give you more insights. Why should I record it, create overhead and send it over the wire? Why should I irritate my user with all this data, if I already have all data? So for me it is a myth that “Having all transactions is better than sampling” as 1) you cannot sample all tx in full detail on big, distributed systems and 2) you would have too much data to analyze.

    As Google states in his Dapper paper (so its Google saying this, not me): For them it was enough to record (even randomly) every 1024 transaction to resolve all their problems in two years – as statistically they had all data they needed!

    Hope this clarified what I meant.

    Mirko

    • Roman Spitzbart Roman Spitzbart says:

      Hi Mirko,

      With all that decision logic – built into the agent of course to avoid capturing too much – it sounds like you might be adding more overhead with fuzzy logic, statistical test and so on :-)

      On the all data vs. some data approach – I guess we can only agree to disagree. I brought my arguments to the table on why all data gives you flexibility in the analysis, you brought yours on why you feel having the APM solution decide what to capture is a better approach. Will there be limits to both approaches – absolutely! Each and every system will only scale to a certain point … but so far our customers are happy with how far we can get.

      I can only invite everyone who is looking for a APM solution to try it out, have your architects and developers (the main consumers of the in-depth data) work with the transaction traces and speak to customer references in your region and verticle. You don’t have to believe me, you don’t have to believe Mirko – see for yourself!

      Best, Roman

  4. Hi Roman,

    I personally think that today “Big Data Analytics” is the way to go – it’s hard to understand, but Amazon sends your order to you before you order it – so you will also be able to find out which transactions are important with much less overhead and highly scaleable – if we meet somewhere I we can discuss this in detail during a cold beer. :-)

    But I think you said everything a customer should know about *any* APM system: “Will there be limits to both approaches – absolutely! Each and every system will only scale to a certain point .”

    Everything depends on the system you have (distribution, number of systems, complexity, load, …), the requirements (overhead, details, etc.) and the users (as you said, a developer has different experience than an ops guy). So go an test it by yourself.

    Thanks for the constructive discussion.

    Mirko

  5. Zuher Al Riahy says:

    Hi All,
    I think we are talking different languages here, if we can capture all the transaction with reasonable overhead this will be great as APM it self is all about users who do transactions.
    I have implemented dT in many places and see no problem at all in the performance (the overhead was absolutely reasonable), but if you go to argument is it going to work everywhere then it is not?
    till now , i do believe more that 99% of our clients in my region (ME) is satisfied and do not complaining about any performance degradation after we deployed the solution, still there is 1% that we cannot be sure that we can really cover , but i do belief that all the solutions in the world has limitation and cons and pros and what dT is offering for my 99% is what i need to make my clients happy with the EUE they expect to see.

    Zuher

  6. I’m a little late to the party but on the subject of sampling and the date Dapper paper I would recommend that you explore many other Google research papers that have been published, which clearly show the importance of tracking all transactions (service interactions) to reduce variability.

    http://www.jinspired.com/site/google-engineering-on-performance-variability-the-tail-at-scale

    Hopefully you will see that monitoring serves management but not in the sense of a dashboard but in controlling in-flight request processing and patterns. It is control (or influence) that we seek and only the machine can effectively achieve that. The good thing is that this can be done both locally and globally depending on a number of factors related to reliability, performance and so on.

Comments

*


− two = 4