Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

How to Automate Google Analytics Analysis

This is a two part article – also read Combining Analytics with Performance Management Data.

I use Google Analytics every day to look at Key Web Performance Metrics to answer questions like:

  • How many people visit our blog?
    I look at the Visitor Overview which shows me how many Visitors and Unique Visitors we have
  • Did they find us through a search, a different web site that linked to the blog or are they regular readers?
    The Traffic Sources analyze the Referer headers and list all Traffic Sources (e.g: links from dzone.com, theserverside.com, …) and all keywords that have been used in a search
  • Which posts keep users on the page and encourage them to read others?
    The Content View shows me the top pages on my site. I can then analyze Bounce Rate and see which blogs have a lower bounce rate than others
  • Which posts drive traffic to other sites such as our Community Portal or the dynaTrace web site?
    Here I again use the Traffic Sources View and see how many people got to the Community Portal or dynaTrace Site from blog.dynatrace.com
  • From which regions of the world are our users visiting and what are the main topics that are interesting for them?
    The Map Overlay gives me a map of the world, highlighting from which regions,  countries or cities requests come from

Here is an illustration of one of the Google Analytics Reports (Traffic Sources) that I look at regularly:

Tells me from which sources our users came and which keywords they used in the month of June

Tells me from which sources our users came and which keywords they used in the month of June

Clicking through all these reports is great. It is visually appealing and really allows me to answer all these questions listed above. But – we can do much better by automating these analysis steps and also combining this data with other data sources such as infrastructural or application performance data in order to answer many more questions such as: Why is my Bounce rate going down?

Why I want to automate the data analysis?

These are many questions that Google Analytics helps me to answer. I only have one problem with it: I don’t want to launch my browser multiple times a day, logon to Analytics and then drill into multiple reports to get these answers. I want to look at a single report that has the answers to the questions I raised in the intro paragraph and avoid all the clicking.

Why I need the Analytics Data outside of the Google Portal?

We run dynaTrace on our production servers which host our websites and the community portal. We monitor these sites and have a pretty good understanding about the performance of these applications as well as the underlying infrastructure. We display this data in dashboards that tell me right away if we have a problem in the application or infrastructure. What’s missing is Google Analytics data displayed and correlated with the dynaTrace data in the same dashboard/report. Why is this data important? Because I want to explain End-User-Behavior with Application Performance Management Data.

This helps me to answer questions like:

  • Is the high Bounce Rate on the Community Portal a result of bad application performance?
  • Is the number of Pages/Visit low because the first page is so slow that it keeps people from clicking other pages?
  • How many requests actually make it to our application server vs. how many are served by cache proxies or CDN’s?
  • What is the impact of New vs. Revisiting users on the Application Server?

How to work with Google Analytics and the Analytics API

The answer to both of my questions is the Google Analytics API. And from here on I will discuss all the steps it takes to setup your site with Google Analytics and to query the data with an even higher granularity than what the Google Analytics Portal gives you.

Step 1: Getting Google Analytics on your Web Site

This is of course mandatory. Getting Google Analytics on your page is easy. All you need is to sign up for a Google Account and add a piece of JavaScript on those pages that you want to monitor. If you did it right you can login to your Google Analytics Account and you can explore all different websites you are monitoring (you can monitor multiple sites with the same account):

Google Analytics Overview with list of all monitored Web Sites

Google Analytics Overview with list of all monitored Web Sites

Step 2: Overview of Google Analytics API

Google has great developer documentation on how to get started with the Google Analytics API. Check out the Data Export API for Java which also explains how to setup your Eclipse environment, which libraries you need and how to query data. It is really straight forward. There are 2 things you need to know: Your Google Account Information (username and password), and the specific website you want to query data for. All you then need to do is figure out which dimensions and metrics you want to query: Dimension and Metrics Reference.

Step 3: Connect to Google Analytics and query some data

The AnalyticsService class is your door to the analytics data. You have to pass it an application name that identifies your local app and then set the user credentials

AnalyticsService analyticsService = new AnalyticsService("dynaTrace-GAMonitor-v10");
analyticsService.setUserCredentials(USERNAME, PASSWORD);

The next task is to figure out what data table to query. Google organizes every registered web site in tables. So the first thing to do after connecting is querying the account information (which websites you have access to with your Google account) and figure out which page you want to query data from.

// Construct and execute the query to retrieve a max of 50 registered accounts
URL queryUrl = new URL(
 "https://www.google.com/analytics/feeds/accounts/default?max-results=50");
AccountFeed accountFeed = analyticsService.getFeed(queryUrl, AccountFeed.class); 

// Now lets find the account that we identify by its name and get the internal table_id
List<AccountEntry> entries = accountFeed.getEntries();
for (AccountEntry entry : entries) {
  if(entry.getTitle().getPlainText().equalsIgnoreCase(WEBSITE))
    table_id = entry.getTableId().getValue();
 }

Now as we have the table_id (it is a string value) we can go ahead and query some data. Here is some code that shows how to query PageViews, Bounces, Visits and New Visits grouped by Date and Hour for the current day:

// Create a query using the DataQuery Object.
DataQuery query = new DataQuery(new URL("https://www.google.com/analytics/feeds/data"));
// Set the current date, dimensions, metrics and table_id
Calendar cal = Calendar.getInstance();
query.setStartDate(String.format("%1$tY-%1$tm-%1$td", cal));
query.setEndDate(String.format("%1$tY-%1$tm-%1$td", cal);
query.setMetrics("ga:pageviews,ga:bounces,ga:exits,ga:newVisits");
query.setDimensions("ga:date,ga:hour");
query.setIds(table_id);
// Make a request to the service and get the DataEntries
DataFeed dataFeed = analyticsService.getFeed(query.getUrl(), DataFeed.class);
List<DataEntry> entries = dateFeed.getEntries();

The result is a list of DataEntries which in this case should contain 12 entries – one for each hour of today’s date. Entries of hours in the future will all contain zero values – but those until the current hour will contain valid data. Here are some examples on how to read the values in these entries:

for(DataEntry entry : entries) {
  // values (dimensions and metrics) can be retrieved with the valueOf methods
  String hour = entry.stringValueOf("ga:hour");
  long pageViews = entry.longValueOf("ga:pageviews");
  // or by getting the actual Dimension or Metric Object
  hour = entry.getDimension("ga:hour").getValue();
  pageViews = entry.getMetric("ga:pageviews").longValue();
}

Step 4: Query data with smaller granularity as you get in the Google Portal

I’ve noticed that Google returns data that just recently came in. If you log on to the Analytics Portal the time range by default shows the last month (excluding today). If you change the time frame you can actually get the values of today where we get data that is “almost live”. With “almost live” I mean that we get data from today – but I can’t say for sure how current it is. The Analytics API provides the same feature – meaning – I can query data from today and get the data that has been collected so far. If I now query the data in a regular interval – lets say every 10 minutes – I can watch how the data comes in over the course of the day. If we then calculate the Delta between two taken samples we get analytics data at a much finer granularity as is displayed in the Google Portal. But again – it is not guaranteed how often Google actually updates the data – but – based on my work it seems that when running with a 10 minutes interval I constantly get new delta data.

Tips and Tricks on getting the Delta

I noticed two things:

  1. Even though my local time is still “today” I am sometimes able to query data for “tomorrow”. How can that be explained? I assume that Google takes the users location into account – meaning that when it is already “tomorrow” in India requests from that region may already be accounted to the next day.
  2. When I use an hour granularity I have seen that multiple hour entries – let’s say for 2, 3 and 4PM have updates between two measures points. This could again be explained with the assumption I made in the previous point -> Google taking the end users location into account

In order to not miss out on any data I did the following. I set the date range to span from “yesterday” until “tomorrow”:

// set the date range to include yesterday, today and tomorrow
Calendar cal = Calendar.getInstance();
cal.add(Calendar.DAY_OF_MONTH, -1); // yesterday
query.setStartDate(String.format("%1$tY-%1$tm-%1$td", cal));
cal.add(Calendar.DAY_OF_MONTH, 2); // tomorrow
query.setEndDate(String.format("%1$tY-%1$tm-%1$td", cal));

When I have two samples (either down to the HOUR granularity or only by DATE) I always calculate the Delta of all entries and then sum up the deltas. This basically gives me the delta of requests that came in between my two sample intervals – taken into account that some requests may have been accounted to “yesterday” (In case somebody in a later time zone just requested a page) and to “tomorrow” (In case somebody in an earlier time zone just requested a page).

Step 5: Displaying the data

You can write the measured data to any data source you like and then take whatever tool you are using to visualize the time-series based data. It could be that you write it out to a CSV or XML file and then use Excel. It could be a database and then use your own internal tools for data visualization. In my case I use dynaTrace. I packaged the code I wrote into a dynaTrace Monitor Plugin. It is an OSGI Plugin that has a setup, execute and teardown method. dynaTrace takes care of initializing my plugin and executing it on a scheduled interval.

dynaTrace passes an object to the plugin to read configuration data (such as the username, password, website, whether I want to return the actual last value retrieved or the delta of the last two samples, …). The same object is used to return the monitored data that gets stored in the dynaTrace Repository. The easiest way to visualize the data is by using a Chart. I’ve setup monitors to query the Google Analytics Data for blog.dynatrace.com and community.dynatrace.com on a 10 minute interval. I’ve actually set up two monitors for each website where one returns the current value of the current hour (same as if you go into the Google Analytics Portal and group by hour and look at the last value) – the other one to display the Delta between the Sample intervals (this tells me e.g.: how many users visited in the last 10 minutes). The following illustration shows a screenshot of a dynaTrace Dashboard with a Chart where I display Page Views, Visitors and New Visitors in an overlay chart type. This tells me how many pages were viewed by how many visitors and how many of these were actually new visitors:

Many more PageViews/user on the Community Portal. The Delta View gives us a better understanding of user distribution

Many more PageViews/user on the Community Portal. The Delta View gives us a better understanding of user distribution

Let me explain a bit more what we see in this image. The top two graphs show the data that I capture in “DELTA” mode – meaning – every 10 minutes I only report back the delta data from the previous data sample. The bottom two show the actual total value of the last data sample for the last hour. On the bottom two we can see that always at around 30 minutes past the hour a new hour starts with Google Analytics – as it always drops down because I am returning those DataEntry values from the last valid hourly entry. The top two follow a similar pattern, but not totally the same. I assume that Google internally has its intervals when data is pushed out to the API. So it makes sense that even if we only look at Deltas the data that we collect is still somewhat effected by the way Google publishes the data.

Overall I am happier with the Delta. My charts now allow me zoom out and display the data on an hourly, daily or weekly base. Which makes it easy to spot trends such as high user activity on certain days (Monday morning when people get into the office, …)

Step 6: Correlating with Application Performance Management Data

Now we have all this rich end-user analytics data in our own system (or your own database, csv or xml files). It’s time to go ahead and compare and correlate it with other performance relevant data from our other monitoring systems such as our infrastructure monitors or application monitors. I wrote a separate article on Combining Analytics with Performance Management Data which discusses how to answer questions like:

  • What is the impact of an application problem on visitors, page views or bounce rates?
    If I have a known problem in my application, e.g.: certain pages are slow or don’t respond at all – how does that affect other users?
  • Is an application problem responsible for higher bounce rates?
    If I see bounce rate going up – is it because the content is not good or is it because people leave the site because certain landing pages are slow?
  • Is an application/infrastructure problem responsible for fewer page views per user?
    Is our application or network infrastructure not responding fast enough leading to users drop off the page sooner?

What do you do with your Analytics data?

I really like Google Analytics and it is great to have an API that is very flexible to query the data to use it for my own needs. I am interested in how you use the analytics data. Everybody has their own approach to read analytics data like this.

Comments

  1. Steve Thair says:

    And if you use the free version of Pion from Atomic Labs (www.atomiclabs.com) you don’t even have to tag your pages… you can just sniff the HTTP clickstream off the wire and fire the data over into Google Analytics, just as if you’d tagged the pages :-)

  2. Hi Steve. Thanks for the link – looks interesting.
    It would still however be necessary to place JS on the page as you dont capture all traffic with sniffing, e.g.: traffic served by a CDN – correct?

  3. HI Andreas,

    No, you can only get the traffic within the hosted domain(s) that are accessible via the spanned port/network tap.

    That said, from a pure “web analytics” point of view i.e. visits, visitors, page impressions etc then you really only care about the “page” itself not necessarily the objects.

    From a “performance analytics” view you can still get the data on how fast the page was served (across different servers in your webfarm), plus the transfer times for all of the page objects hosted locally.

    It’s an interesting performance trade-off given that poeple often have multiple tags on their page (e.g. Google Analytics and one of Webtrends/Omniture/Unica etc). If you can save the page weight (and in some cases javascript blocking) by using a network sniffing approach AND save the hassle of rolling out the JS page tags but still get the same analytics data in the same vendor reporting portal then its worth considering!

    cheers,
    Steve

  4. Hey mate!
    Thanks for sharing it. Nice post, really I loved it and some of the topics on your blogs are much interesting. Keep updating it.

  5. This is epic, up until now we have been using a pdf imported from the automated reports you can get ganalytics to send you out – weekly in our case – The system was written more than a year ago but we are now rewriting to directly access using some of this code – Thanks for starting us in the right direction!

  6. Hi Andreas,

    This is really nice blog/article. I found google analytics to be intrusive, in the sense we need to add JavaScript to that pages we want to monitor. Is there any tool/api which can give similar data without having to modify the code?

    Vijay

  7. Vijay, Google Analytics basically adds an image request to the DOM which sends the analytics information to Google via parameter that gets passed to it. Check out http://code.google.com/apis/analytics/docs/concepts/gaConceptsOverview.html#howAnalyticsGetsData which explains how this image url gets generated by the javascript that you embed

    as the analytics data itself is really sent through this image it is possible that you simply add a similar image url to your page – without using JavaScript. I havent done it personally but it should work – as long as you know which type of parameters you need to add to the image request. There is another blog that talks about this: http://www.vdgraaf.info/google-analytics-without-javascript.html

  8. Great article – thanks. I want to capture an image of the Google Analytics dashboard report monthly. Is there any wqay to do that without having to log in to Analytics and PDF the report?

  9. Oh, right – thanks!

Comments

*


+ 9 = twelve