Alois Reitbauer About the Author

Week 2 – The many faces of end-user experience monitoring

Inspired by a comment of Wim Leers on one of our other posts on web performance, I decided to switch plans and write this week about end-user experience monitoring. If you google end-user experience monitoring you will find a number of different approaches.

End- user experience – as I think we all agree – is the performance perceived by the actual user at a specific point in time. Sounds simple, but is in reality not that easy to ascertain. End-user monitoring is also often referred to as the last mile in monitoring (somehow reminds me of the movie with Tom Hanks). This is mainly because it is really complex and difficult to determine. In fact the closer you get to your user the harder it gets.  Additionally you suddenly have to monitor thousands and thousands of users.  This means also that you can only send a limited amount of data as otherwise you would need be able to process the amounts of information. Furthermore this information has to be sent over the wire. Sending lots of performance data from your end users won’t make them happy either.

Data Collection

Before looking at the different approaches, let’s first discuss what information we really want to track.

Instead of looking at the data, let’s discuss which questions we want to get answered by end-user monitoring. These questions will bring us to the required data anyway. First we need to get monitoring information about general user experience:

  • How long did it take to load the page?
  • Where there any problems on the page?
  • How long did certain actions take on a page?

Secondly, in case of problems we want to get additional information which helps us to understand and diagnose them such as:

  • What was the reason for a slow download – the network or the server?
  • What did the user do that caused a problem?
  • What browser, operation system and connection was used?

The first question can be answered by tracking resource load times. Here we want to measure the follwing metrics:

  • Time of First Byte – When was the first byte of the web page received
  • Time to First Visual – When was the page content visible the first time. This is the first time a drawing operation occured
  • Time to On Load – When was the onLoad event of the page executed
  • Time to Page Ready – When all intial content is ready and JavaScript execution can safely start.

The network times should be split up into wait time (the delay until a browser connection available),  DNS lookup time, transfer time and server time. This is especially useful in diagnosing network related problems. Further we want to see HTTP Headesr to find improper caching configuration or problems like HTTPS related connection problems.

So far this information was very much what we need for any type of web page. For Web 2.o applications we need to get much more information. We are additionally  interested in the time it takes to execute end user events like the click of a button, round trip time for AJAX requests or rendering times and JavaScript errors. When talking about AJAX requests I not only mean XHR Requests but also requests made using dynamically-generated script blocks.

Having an understanding of the data we want to collect, let’s now look at the various technologies which are used to collect end-user experience metrics.

Synthetic Transactions

Synthetic Transactions are based on the concept that they emulate real users. They use pre-recorded scripts defining end-user behavior that are executed at defined intervals. Providers of these tools execute these scripts from up to more than one hundred locations worldwide. We at dynaTrace provide a script-based monitoring plug-in for the same purpose.  While they are not really monitoring the perception of real end-users, they constantly monitor the performance of specific pre-defined transactions.  Transactions, however, have to be “read-only” as they would otherwise kick off for example real purchase processes. This limits the usage to a certain subset of your business critical transactions. If you create a kind of dummy user in your application whose transactions will be filtered out later in the process, you might be able to monitor more transactions.

The key point here is that synthetic transaction monitoring is not about the performance perceived by real users of your application. It rather acts as a reference measurement that will help to show performance degradation, detect networking problems or provide notificatons in case of errors.

There are also numerous SaaS providers offering this service. Technically there are two different approaches how requests are generated: Some solutions replay recorded HTTP traffic patterns, others drive real browser instances. The advantage of the second approach is clearly that they need not emulate browser behavior. Especially in Web 2.0 applications which rely on the heavy usage of JavaScript  and AJAX communication only a browser-based approach is feasible.

Overview Synthetic Monitoring Architecture

Overview Synthetic Monitoring Architecture

Network Sniffing

Network Sniffing is the second category of end-user monitoring tools. Unlike synthetic transactions they rely on real end-user traffic.  Special appliances are used which monitor the whole network traffic being sent between clients and web servers.  These appliances however are now within your own network, meaning they are farther away from your end users.  On the other hand you can use them to monitor real end-user traffic. They analyze end-user traffic in real time, verify response times against SLAs and also check content for correctness. These monitoring solutions also provide the actual HTML seens by your end users and enable you to follow the click path of your users.

The disadvantage of this approach is that no browser-level metrics are collected. Problems causes by massive DOM access, excessive JavaScript execution or rendering problems cannot be found at this level.

Overview Network Sniffing Architecture

Overview Network Sniffing Architecture

Instrumentation at Browser Level

The closes place to the end user is the browser.  Therefore collecting metrics at the browser-level is the most accurate way to monitor end-user perceived performance.  Monitoring at the browser level is achieved by injecting JavaScript monitoring code into the page. The easiest way to do this is using header and footer injection. Small portions of JavaScript script code is injected at the beginning and the end of a page. This code will collect data for certain browser timings like first byte received, page completed or onLoad. Additional instrumentation can be used to collect more metrics by injecting additional JavaScript measurement code. Examples are the Google Logging API for Speed Tracer which uses the console logging API of Webkit.

How to efficiently get this information into the JavaScript is still an open issue, however. Ideally this is done automatically during the loading of the code. This means every web request must modified either in real time or there is some pre processing of JavaScript resources.  The alternative approach is to add additional logging information into the code. This kind of source-level instrumentation required developers to add these calls.

The challenge is then to deliver these results back a centralized monitoring server. Episodes by Steve Souders suggests the use of beacons. Beacons are small web requests with piggybacked monitoring information. Alternatively XHR requests can be used to communicate with a monitoring server. Both approaches however use browser connections which are also required by the application itself. Communication should therefore be kept to a minimum with small payload only.

On the server-side the challenge is to process this potentially huge amount of data. Thousands and thousands of clients sending small packets still leads to huge amounts of data. Specific requirements of Web 2.0 application impose another challenge here. Plain page timing metrics might not be enough for Google Mail and other single-page highly-interactive applications. Monitoring the response times of XHR requests is essential to understand the communication behavior of the application. The ZK framework’s performance monitor for example, offers such capabilities.

While we get much more metrics at this level we still miss important details.  We will not get any rendering information as this is information we cannot query on the JavaScript level. Tracing of network requests also represents a challenge at this level as we have to inject code everywhere where resources get loaded.  Analyzing whether a resource got loaded from the cache or not is not possible either as such information is not available at this level.

Browser Plug-Ins and Extensions

If we want to get even more information there is no way around using a browser plug-in or extension. There is Firebug for Firefox, Speed Tracer for Chrome and dynaTrace AJAX Edition for Internet Explorer. The major disadvantage from an end-user monitoring point of view, is that the user has to install the plug-in first. While this is no issue in development or test environments, it is an issue in production. Apart from that these tools provide a lot of information which is by far too much for end-user monitoring at a large scale. The major advantage however is that you get the most details and can overcome the limitation of JavaScript injection in browsers.

Currently these tools are mostly used in test and development environments. In some cases they can also be used for troubleshooting end user problems – if the user agrees to install the plug-in. They however have a great value for optimziing end-user experience up front. The detailed metrics of these tools allow optimizing the end-user performance for specific browsers. Below you can see a screenshot of a dynaTrace AJAX Edition timeline view showing rendering, JavaScript and download behavior of the browser as well as a detailed trace of JavaScript execution.

Browser Diagnosis Showing Detailed Metrics on Rendering, Download, etc and a Detailed JavaScript Trace

Browser Diagnosis Showing Detailed Metrics on Rendering, Download, etc and a Detailed JavaScript Trace

The Role of Server-Side Monitoring

Although this post is about end-user experience, I want to include some words on server-side  monitoring as well. One could argue that this the approach is the furthest away from the end user. This is true; however, application performance monitoring at the server side also provides insight into end-user performance. Very often server-side monitoring is combined with end-user monitoring. The image below show an example of  an integrated monitoring view from active monitoring with server-side metrics. Problems having their root cause on the server side can only be efficiently diagnosed using server-side monitoring and diagnosis data.

Integrated End-User and Client Side Monitoring using Synthetic Transactions

Integrated End-User and Client Side Monitoring using Synthetic Transactions

What will the future bring us?

This is an interesting and difficult question to answer. It is best answered by looking at the challenges we are facing. The first challenge is the tool challenge. Currently it is not possible to use a single tool to get all the information you need.  The reason is that it is currently not possible to collect all information using one single technology. JavaScript injection is easy to roll out however, it has limitations regarding the data it can provide. Browser plug-ins require explicit deployment while enabling the deepest insight into browser behavior – not to mention the roll out challenge. Network sniffers – while beeing able to capture and correlate the total traffic of a user -  have no insight into the browser. Synthetic transactions basically serve a slightly different purpose.

The future will be first about integrated tool chains combining information from many sources in a seamless way. Monitoring at the browser level will be done using more and more sophisticated approaches. Hopefully browser developers will open the functionality they currently only provide via plug-ins and also allow the development of standardized extensions based on a unified API. A first draft of a unified interface is the Web Timing Working Draft.

This post is part of our 2010 Application Performance Almanach.


Comments

  1. Thatks for information. It’s very usefull for me.

  2. Thanks for the mention!

    This is the most comprehensive post regarding “end-user experience monitoring” (as you call it) yet, to my knowledge. Awesome!

    Also, I agree with all points in this article. You even managed to point me to some things I didn’t know of yet:

    the network-sniffing style performance monitors such as Coradiant
    Speed Tracer Logging API

    I’ve only got a minor addition to make. Jiffy is a system that’s similar to Episodes, but predates Episodes by some time. It was developed by whitepages.com for their specific performance monitoring goals. One of the goals was to be able to monitor every single page load. So Jiffy did just that. It was even able to analyze the log results in near real-time. The most important component was how the log was kept: it simply used the Apache web server’s logging capabilities. They wrote a script to parse and process those log files, in Perl if I remember correctly. They didn’t even have to apply fancy tricks like sending the logging to multiple servers. And most importantly: they generate millions of page views per month.
    That strongly suggests that true real-time performance monitoring of actual end-users, and every actual end-user (!), is perfectly doable.

    I do have a general request though: together with the people active in this area of performance, I’d like to define a definitive set of terms. The difference in naming strongly decreases general understanding, discoverability of articles and perhaps not least importantly, marketability I.e.:
    - Steve Souders calls this “front-end performance”
    - I called it “page loading performance” because it’s more accurate than “front-end performance”. It also distinguishes more clearly among page loading versus page rendering. The monitoring I called “page loading performance monitoring” as the general term and “real-world page loading performance monitoring” when the monitoring runs from a real-world perspective, i.e., in the end-user’s browser.
    - You call “end-user experience monitoring” what I called “real-world page loading performance monitoring”.

    I think the entirety of all web site performance optimizations should be named “Web Performance Optimization” (WPO) — as Steve Souders already called it.
    An initial proposal toward a full-fledged taxonomy would then be:

    Web Performance Optimization

    page loading performance (CSS/JS + loading of resources; page loading therefore always consists of both client and server; putting client-side performance separately makes no sense since loading of resources is always necessary)

    synthetic transactions
    organic/real-world transactions (i.e. monitoring these would equal what you called “end-user experience monitoring)

    server-side performance (whatever generates output + HTTP server, possibly behind a reverse proxy and whatnot)

    synthetic transactions
    organic/real-world transactions

    This is most likely missing a lot of refinement, but roughly, it should be correct.

  3. Ben Rushlo says:

    Great information. I will say if you have to choose one approach (most companies can’t explore all in parallel), I would suggest synthetic browser based measurements.

    I am biased (I work for Keynote Systems) but having done performance management for 10 years, using a synthetic transaction solves many of the variability/sample issues of other approaches.

    Performance management is best done in a controlled manner and if you can guarantee that your sample is always the same (ie. same hardware, software, locations, network etc) than detecting even slight changes (that you can influence) is possible.

    The weakness with this approach is that you can’t see every user and that some transactions cannot be done synthetically (think of applying for a credit card using unique SSN#). In those cases a instrumentation of some sort (or a network sniffer – though with CDN’s and other hybrid hosting I am not sure this approach is going to be valid for much longer) is the best approach.

    We have been working lately on combing our metrics from the browser with the Dynatrace Ajax Edition tool with great success with customers. Now when there is a client/browser delay we can determine exactly what JS or Web Service call is causing the problem. The combination is very powerful. Adding a back-end view of application health is the logical next step. Doing this would provide the browser  network  application view which is critical.

    Ben Rushlo
    Director – Keynote Systems

  4. Aymen Touzi says:

    Very interesting post about techniques used for end-user experience monitoring.

    However these tools are may be dedicated to monitor and assess and optimize performance at the browser side.

    Is there other tools (especialy opensource/free) that offers the ability to continuously monitor end-user experience?

Comments

*


− 5 = four