Challenges of Monitoring, Tracing and Profiling your Applications running in “The Cloud”
Cloud Computing presents unique opportunities to companies to reduce costs, outsource non-core functions and scale costs to match demand. However, the Cloud also presents a new level of complexity that makes ensuring application performance in the Cloud a unique challenge, in particular with the many different usage and deployment scenarios available. Perhaps the most popular present scenario uses the Cloud to perform certain tasks where additional computational power is unavailable in a local environment, e.g.: running large scale load-tests or processing large amounts of input data into something else. Another scenario which is becoming more attractive these days is to actually run applications in the Cloud.
The big question that circles around this second deployment scenario is whether to use a public or private cloud. The use of public cloud services raises many questions:
- Is my data really safe with the hosting service provider?
- How reliable is that service?
- How can I do trouble shooting in case something happens?
No matter whether you deploy your application in a private or public cloud, Cloud computing requires a platform that can manage the dynamics of the application within this mostly-virtual, opaque environment. One of the biggest challenges presented by the dynamic nature of the Cloud is troubleshooting performance issues. There are currently no good approaches to quickly identify the root cause of application performance issues in the Cloud. Existing tools and solutions are limited in the way they capture information. Solving issues in Cloud Environments today involves inefficient manual effort from the most valuable resources of the application development team: The Architects and Engineers.
Looking at a Cloud Computing Platform
Cloud Computing Platforms – whether privately or publicly hosted – provide the ability to dynamically add additional resources as needed. This for example allows handling peak load on a hosted application to ensure that application end user response times stay within SLAs (Service Level Agreements). Cloud Computing, however, is not only about adding more virtual servers or resources to your virtual infrastructure. Cloud Computing Platforms offer Services to the hosted applications providing the base foundation on which to build scalable applications. These services include data storage, messaging, caching …
Cloud Services: Let’s take Data Storage as an example
Applications hosted in the Cloud can use Service Interfaces to access application-specific data. This data is stored “in the Cloud” and can be accessed by any component and any instance of the hosted application. The Data Services ensure reliable access, concurrency, backup, …
Instead of using interfaces like JDBC, the application uses the data storage interface like In-Memory-Data-Grid to query objects from the data store, add or manipulate data. Accessing the data via this interface enables the application to scale depending on the required bandwidth, concurrent users or amount of concurrent HTTP requests. With increasing load on the application, the Cloud Computing Platform can deploy additional virtual machines in order to handle the additional number of transactions. Additional deployed application instances work seamlessly against the same Data Service interfaces.
Integration with services that run “outside the Cloud”
Most often applications need to get access to resources other than those provided by the Cloud Services. These could be external services available on the internet – like a payment, search or mapping service accessed via Web Service or RESTful interfaces. It could also be accessing data from other applications that you run – most likely applications that you run on-premise, e.g.: your in-house CRM. In order for that to work the Cloud environment must allow outbound connections from any virtual instance.
The BIG PICTURE
Following illustration shows what an application architecture – hosted in a virtual cloud environment – could look like:
On one side you have the end-users that work with the application. Depending on the load and on the response times the requests could be handled by 1, 2 or many more virtual instances that host the application. The application makes use of Cloud Services like Data Storage Services to persist and share data between the virtual instances. External or In-House services might be called via remoting technologies.
The Cloud is a complex environment that can dynamically change. Each request that is executed by an end-user can take different routes through the system and can affect other parts of the overall environment.
What is happening in my Cloud?
A pressing topic in Cloud Computing is monitoring, tracing and profiling. Ensuring SLAs (Service Level Agreements) to the end user can be done rather easily. In case application response times slow down – additional application instances are automatically deployed in order to handle additional load and to better distribute the load across more instances. The Cloud Platform takes care of it.
But is that the correct approach? Adding new virtual instances to handle additional load is fine. But what if your application actually has a performance problem? Adding new virtual instances of course solves the problem in the short run. But basically it is like taking more Advil when having a tooth ache –it actually doesn’t solve the root cause of the problem, which might be a cavity or a broken tooth
Root-Cause Analysis in the Cloud
In order to understand why the current deployment is not able to handle the current load it is necessary to look beyond end user response times and performance counters like CPU, Memory, I/O and Network Utilization.
Monitoring the services running on the cloud gives additional insight into where the time is spent and can also uncover problems in the application itself by identifying “improper” use of service interfaces. As with other architectural guidelines for “non-cloud” applications – it’s essential to be careful with the resources you have. In a traditional application you want to make sure to limit the number of roundtrips over remoting boundaries or to the database. You want to make sure that your SQL statements are well written and only return the data that you need.
The same rules apply for an application that runs in the Cloud and that accesses the Cloud Service Interfaces. The challenge until now was to monitor the activities of the application within the Cloud.
A big limitation is that it is not easily possible to remote debug through your code or to install a profiler on the virtual machines to really understand how the deployed application components communicate with other components or services.
The question that needs to be answered is
How can we get insight into the dynamics of a deployed Cloud Application?
Instead of answering this question I want you to read the following blog article: Proof of Concept: dynaTrace provides Cloud Service Monitoring and Root Cause Analysis for GigaSpaces
This blog explains how the questions raised in this blog could be answered for an application running in a GigaSpaces Cloud Environment with the use of dynaTrace.