Fact Finders: Sorting out the truth in Real User Monitoring
On my recent visits to Velocity, WebPerfDay and Apps World in London, Real User Monitoring (RUM) was the hot topic. That triggered my thinking about the differences between vendors. They all promise the same for a varying range of prices – from free to couple thousand US Dollars. What I found out is that there IS a big difference and – depending on what you want to do with RUM – you want to make sure you understand the capabilities and limitations of the available solutions.
The false claim of 100% Coverage
What all vendors claim to do is capture data from 100% of your users. When looking closer you see that many of these solutions – especially the “Freemiums” – rely on the W3C Navigation Timings. So my question is: How can I cover ALL Users with W3C timings when these timings are NOT AVAILABLE on all browsers?
W3C timings are only available on new browsers. So –what about the IE6, IE7, IE8, the whole Safari Browser family, older Firefox and Chrome instances? Looking at current statistics they sum up to 35% of the overall market share (http://www.w3counter.com/globalstats.php). The statements of vendors that rely on these timings to capture all users experience are simply not accurate.
The performance impact of monitoring
After finding that out I just asked myself: “Are there anymore deficiencies that can be found?”
I first thought about the collection mechanism which reminded me of the challenges all the Web Analytics tools have. Data collection relies on the browsers onUnload event. The RUM tools have to collect the data till the last second of the lifecycle of the page and then send it off. Most SaaS solution vendors are using an image GET request to send the data to the collection instances. Modern browsers are optimizing this event because ”Why should a Browser download an image if the page is about to die?” Modern browsers like Chrome optimized this use case and simply do not execute the request at all or do not wait for response if the data got sent. So again- I am losing data from my real end users. The work around some of the vendors put in place is putting a timeout in the onUnLoad-event. I’ve seen timeouts with up to 500ms which impact the next page that gets loaded. We want to improve the user experience/performance but these tools are forcing the user to wait longer to move to the next page.
So we are losing all the old browsers and additionally the modern ones that do not execute the data collection requests. We are now far away from 100% coverage.
Do the math
Another argument you always hear is that the RUM solution allows you to find out more about the end user environment’s impact on page performance. The geographical region of the end user, the browsers, the OS or device can result in slow page performance. But does this really work?
Let’s do some simple math and figure out what this means to a page with 1 000 000 visits a day:
- 1 000 000 over all visits/day
- 1 000 000 – 35% visits with no W3C timing support in the browser
- 650 000- 20% not sending the data correct at all or incomplete
- 520 000 captured visits per day
So we have reduced or base from 1 000 000 to 520 000. Let’s start with the break down into the different goupings:
- 520000 broken down by 100 countries
- 520000/100 = 5200 visits/country/day
- 5200 visits per country broken down by 20 Browser Versions
- 5200/20 = 260 visits/country/browser version/day
Let’s break the 260 visits further down by 10 operating system:
- 260/10 = 26 visits/country/browser version/operating system/day
We want to have date on an hourly basis:
- 26/24 ~ 1 visits/country/browser version/operating system/hour
**1 000 000 visits per day =~ 1 visits/country/browser version/operating system/hour! We have done no sampling, we have only country level data, we are looking at visits and not page views!**
To clarify: In this calculation I assume that the visits are evenly distributed over all countries but do not take into account that most solutions do sampling at a rate of 1-20% and look at visits with multiple page views instead of unique URIs – this seems to me as a best case scenario. In reality it can be even worse.
So then, why is Real User Monitoring so popular?…
…because it helps you to improve your Users experience! How can that work after knowing that we might not capture data from all our end users? You only have to change your expectations of what you want to achieve with Real User Monitoring.
What you should expect from your RUM solution is:
- Support for all browsers – not only the new browsers
- A reliable data sending mechanism
- W3C timings support
- AJAX/XHR-requests timing – not only timings for page loads
- The click path of a whole visit – not only separate page views
- Support for desktop browsers, mobile browsers and mobile native applications in combined view
- Landing and Exit page analysis
If your selected solution provides all these features to you can go an additional step further and not only monitor your users, you can do real User Experience Management (UEM). I just want to point out what that allows you to do in some short examples.
Example 2: Why are my customers leaving my web site?
With the UEM you are now able to not only see that your customers are leaving your web site. You can also figure out if they had technical issues
Example 3: What did my customer do on the application before he called our support center?
Having every visit and all actions available makes it easy for the support center employees to lookup the visit information as part of the triage process.
Example 4: Correlating Performance to Business
Analyzing the performance of every single visit and all actions not only allows us to pinpoint problems on individual pages, certain browsers or geographical regions. It also allows us to correlate problems in the application to business. Knowing how much revenue is lost due to declined performance gives application owners better arguments when discussing investments in the infrastructure or additional R&D resources. The following dashboard correlates Response Time with the number of Visitors by Continent and the generated Orders. Problems in the infrastructure that lead to performance problems of the application can then easily be correlated to lost revenue: