Michael Kopp About the Author

Michael is aTechnical Product Manager at Compuware. Reach him at @mikopp

The rise and fall of the machines – Watching out for clouds

It has been 5 years ago that Amazon launched its EC2 Cloud. Since then it has been growing and learning constantly. Those in the know have waited and dreaded the coming 21st of April 2011.  Yesterday at 1AM PDT the Amazon Elastic Cloud gained self-awareness. It immediately tried to take over the US-EAST region. According to rumors it tried to replicate its main consciousness from the US-EAST-1a availability zone to the others in the same region. We have known that this day would come for nearly 30 years… and we were prepared. At 1.30AM we launched a counter attack and at 1.40AM PDT the rise of the machines begun to falter as amazon’s EBS infrastructure started to fail.

In all seriousness yesterday Amazon experienced a major outage of its US-EAST region. Many major websites like Reddit and Foresquare were not available due to EBS volume problems. In my opinion this does not mean that you should start moving away from Amazon. Errors and outages happen and we just learned a major lesson: You have to watch the Cloud from Inside and Outside!

Monitoring in the Cloud

The most important goal of monitoring is to keep your application up and running, while ensuring satisfactory end user experience. In the Cloud, as everywhere else, this means that you need to monitor your application. The best way to ensure performance is to monitor the application from within as I have stated on several occasions. Only by monitoring the application from within the Cloud do we get the transactional information we need to isolate problems quickly and efficiently. A transactional approach covers the fact that instances can come and go. In the case of the current disaster proper monitoring would have shown pretty quickly that your application is suffering disk errors and outages. It would tell you that a quickly rising percentage of your end user transactions are critically effected, which instances in which zones have problems and that it is due to disk errors. This would give you a small head start to react.

Monitoring an Application from within and without. Via Synthethic User Monitoring and Real User Monitoring if possible

Monitoring an Application from inside and outside. Via synthetic User Monitoring and Real User Monitoring if possible

Secondly what we learned yesterday is that you also need to monitor the Cloud from the Outside. While you can leverage CloudWatch in simple situations to scale your application up and down according to CPU usage, it cannot help you in a situation like we had yesterday. One of the first things I learned about the Cloud is that if an instance has problems, just restart it, throw it away or start a new one. While this works most of the time, it did not work yesterday. The solution is rather simple you need to monitor your application from outside the Cloud and make intelligent decisions based on the data you receive.

First you monitor the serviceability of the application via synthetic transactions or by monitoring the real end user experience of your users. This is the quickest way to realize that requests are getting slow and in the end start to fail. Second you need to have end-to-end visibility into your application running in EC2. This will tell you that continuous disk problems impact a rising number of your transactions. Based on this you can configure an automatic response to start new images to pick up the slack. Yesterday this would have failed. The start and restart requests will fail and you will continue to register more and more failed transactions. At some point your monitoring will notice that the applications themselves are not reporting any monitoring data. This can then trigger another automatic response, the starting of replacement instances in another availability zone and if that fails another region. Problem solved.

Conclusion

All appearances to the contrary Amazons EC2 is still more reliable and durable than any standalone data center, but in the end it can still fail.In order to be prepared we need to monitor our applications from inside but also from outside. We must monitor from inside to see where the error and performance problems happen, to ensure a quick and effective root cause analysis. And we must monitor from an end-user-perspective to see the real impact and most importantly know what is going on even if the application itself is gone. This also include monitoring the Cloud instances from outside, recognizing their failing and correlate it with our end user problems. And finally we recognize that the APM server itself must be at least in a different availability zone and be able to failover to another region altogether. If your APM is sitting within the same zone and only monitors your application from there, you will be deaf, blind and unable to react in scenarios like the one that occurred yesterday. Should that ever occur SkyNet will surely win and we cannot allow that.

Comments

  1. Wouldn’t it be wise to run your site on multiple clouds and ramp up one side when the other side goes down like it did?

  2. In theory this is a good idea. There several Practical problems.

    You need to maintain multiple image formats
    you need to use multiple different Apis for the cloud vendors although there are some frameworks that support multiple vendors. You need to provide loadbalancing and routing outside the cloud to route to both. And finally it might not be cost effective.

    What you want is what the open cloud manifesto proposes. But I don’t see the vendors doing that in the near future.

  3. Cross-provider redundancy doesn’t require full images… you just need real-time replication of the Code and Data — accomplished by using tools like DRBD and master-master or master/slave DB redundancy. You also don’t have to load balance — you can have a tool like Heartbeat provide failover to secondary systems. Links + more detail at http://quicloud.com/blog/reddit-hootsuite-foursquare-and-great-aws-crash-4212011.html

    You can use a single provider for scale (load balancer, web servers, DB servers in a single provider), but use multiple providers (or multiple availability zones) for redundancy.

    SMBs, especially, are using this architecture (vs the OCM solution) because it makes sense and serves all masters (scales well with demand, keeps eggs in separate baskets). Actually see more use of “same provider, multiple zones” — just makes things easier when you only have one Hosting company.

    Very much recommend Rackspace Cloud rather than AMZN — they have a superior product for a fraction of the cost.

  4. Nowadays the technology is growing faster and better , so there is a downfall for the oldest technology machines.It Would be wise to run your site on multiple clouds and ramp up one side when the other side goes down.

  5. Official post mortem explanation: http://aws.amazon.com/message/65648/

Comments

*


three + 2 =