How to Monitor and Analyze Performance of the Windows Azure Storage Service
Distributing stored data across different geographical regions is not uncommon. Due to regional laws you might even be forced to store your application data in a certain geographical region. Windows Azure PaaS therefore offers storage services, i.e. BLOB, Table or Queue distributed across their data centers. Distributing data however raises the question if you also need to distribute your application that accesses this data in order to not run into any performance problems. One of our partners contemplated about that idea. In order to get some answers we ran a test to analyze the performance impact of different data distribution scenarios. Let me share our results in this blog.
For this test I created five storage accounts: two in the US, one in Europe and two in Asia. The storage test application (i.e. the Worker Role) and the European Storage Account are both located in the region West Europe. I’ve disabled the geo-replication of one account in the US and one in Asia, to see whether or not this option impacts performance.
For the storage test application, I used the sample available on http://azurestoragesamples.codeplex.com (downloaded April 27), running it five times on one single Worker Role, with five different configurations, i.e. credentials to the storage accounts from above.
First, let’s look at the Transaction Flow which visualizes the flow of every single transaction that is executed in our distributed application. It is easy to see how the different Worker Roles access the different Storage Services. This visualization also makes it easy to get an overview of the application architecture and allows you to validate the blueprint architecture of your application:
How Performance Differs When Accessing Remote Locations
The five test application instances with the name StorageSampleDotNet are all running on the same host accessing five different storage accounts, using all three available storage types (BLOB, Table and Queue). At this point, we see that the remote storage accounts (in US and Asia) are significantly slower than the local European one. We shouldn’t be surprised by that: since there is quite a distance between the data centers we are willing to accept the latency. The bottom line here is it really matters to place storage accounts close to the computing instances in global deployments!
Let’s have a closer look at the following BLOB dashboard to get more information on storage throughput and response times:
We instantly recognize the following:
- Europe is by far the fastest with approximately 1.1ms average response time and roughly 2.9ms maximum response time, followed by the US, about 16 times slower and finally Asia pretty much twice as slow as the US, about 32 times slower than Europe.
- Performance wise, it does not make any difference whether geo-replication is active or not: compare the very left column (geo-replication active) to the second left column (geo-replication inactive), same for the US, the outermost right columns.
- Response times are narrowly scattered, i.e. the slowest and fastest response time do not differ vastly (can be primarily drawn from third line of charts but also from second line of charts where we can see average, 95th percentile and maximum response time).
- Obviously, the number of requests (the load which can be born) depends on the response time: approximately 11 requests per minute by the European account, 3.3 by the US account, and only 1.6 by the Asian account.
Performance Differs across Storage Types
If we extend the dashboard from above to monitor all three different storage types side-by-side we get something like the following dashboard:
For further analysis, we need to know more about the test application. As we know from above, we have five different processes, one for each region. Thus, we can expect that calls to separate regions do not affect each other. However, calls to storage accounts of different types within a region do affect each other since they are executed in a sequence, i.e. a slow response time of e.g. a BLOB service call does affect the throughput of also queue and table service calls. And that’s exactly what we can draw from the dashboard above:
The first row of charts shows us that we have the same load behavior across all three storage types. The slow response time of table and blob storage (second line) equally affects the throughput of all storage types for e.g. Asia.
The second line shows the following interesting points:
- Obviously the Queue service is always invoked asynchronously; otherwise we would have much longer response times and the calls to the remote datacenters would show latency.
- The Table service is the slowest, slower than the BLOB service.
- The BLOB and Table service show latency for calls to the remote data centers.
The last line of charts teaches us that accessing the BLOB service of remote storage accounts worsens the response time to a higher amount compared to the Table service (compare left chart to right). We also notice, that Europe is fastest, followed by US and Asia.
Please note that the charts above only show combined read and write performance across all different access methods and do not provide a closer look at the amount of data that is transferred. Also, response times of synchronous and asynchronous calls are mixed together.
For this kind of analysis, it is important to capture every single transaction end-to-end (from browser through your worker processes back to the storage). Let’s assume you have an application with 90% local and 10% remote Storage Account access. In case response time for the local account increases you want to be aware of it as it impacts individual end users. If this response time is averaged with the much higher response time for the remote account, you would only identify a slight increase of that average time and wrongly assess the situation. Also, you would not be able to split your storage requests as you wish; in the example above I used splitting by Storage Account name and storage type, but any other splitting is possible.
The key message is that we carefully need to keep in mind that storage accounts should be as close as possible to their computing instances, especially if we are dealing with a global deployment. Windows Azure therefore offers Affinity Groups. Storage Accounts can be either created in a specific region (which I did for this test, e.g. West US) or to an Affinity Group. Choose an Affinity Group if you want the Storage Account to be in the same data center with your other Windows Azure resources. Affinity Groups also affect pricing. Read more on windowsazure.com.
The dashboards from above and much more is available for Compuware APM Community members via the Windows Azure Fastpack.
Also, I’d like to invite you to our third and final rerun of our successful webcast “How time cockpit made Azure work for them” on July 19. Please register here.
Finally, I am attending Microsoft’s WPC in Toronto next week and would be happy to meet you there and talk about your Windows Azure experience! Shoot me an email: email@example.com