22 Apr

EC2 outage reactions showcase widespread ignorance regarding the cloud

Amazon EC2′s high-profile outage in the US East region has taught us a number of lessons.  For many, the take-away has been a realization that cloud-based systems (like conventionally-hosted systems) can fail.  Of course, we knew that, Amazon knew that, and serious companies who performed serious availability engineering before deploying to the cloud knew that. In cloud environments, as in conventionally-hosted environments, you must implement high-availability if you want high availability.  You can’t just expect it to magically be highly-available because it is “in the cloud.” Thorough and thoughtful high-availability engineering made it possible for EC2-based Netflix to experience no service interruptions through this event.

Only those companies that failed to perform rudimentary availability design on their EC2-based systems have experienced prolonged outages as a result of this week’s event.  This is only as would be expected – Amazon.com does not promise to make your application highly available.  What Amazon EC2 provides is a rich set of tools that allows anyone that is serious about building a highly available application to do so on EC2.

This week’s EC2 failures have provided plenty of fodder for the cloud skeptics, as well it should. Cloud skeptics hold EC2 and other cloud services’ feet to the fire, forcing them to address real concerns with the paradigm. What is far more alarming than the told-you-sos from the cloud skeptics is the torrent of media ignorance regarding what cloud computing and EC2 fundamentally provide.

Take this article in the Wall Street Journal for example.  Quoting the authors:

A main issue at the center of this controversy is why Amazon hasn’t been able to re-route capacity between data centers that would have avoided this problem and ensured the websites of its users would still operate properly.

Here the authors seem to be referring to EC2 availability zones.  As most who have worked even a little with EC2 know, when you run an instance or store volumes in one availability zone, there is no automatic mechanism available to “re route capacity” between availability zones. If you want your application to survive the failure of an availability zone, you must implement a high availability contingency. For instance, you can frequently back up (using the EBS snapshot feature) your storage volumes so that you can re-instantiate the system in a surviving availability zone. In this week’s outage, all but one of Amazon’s US East availability zones were functioning normally within about four hours. Only those customers with systems in one of Amazon’s four US-East zones could not reliably access their data in their Elastic Block Store (EBS) volumes. If those customers had simply performed regular backups (snapshots) of their volumes, the outage would have been confined to a few hours, not 40+ hours.

It is hard to blame the media though, when even the customers of EC2 showcase a complete misunderstanding of what they should expect from the infrastructure on which they have built the systems that support their very businesses.  In the same article cited above, Simon Buckingham, CEO of Appitalism is extensively quoted misunderstanding the fundamentals of EC2:

We’re past the point of this being a routine outage… Customers like myself have assumed that if part of Amazon’s data center goes down, then traffic will get transferred in an alternative capacity… The cloud is marketed as being limitless, but what this outage tells us is it’s not.

That is an interesting assumption indeed, Mr. Buckingham. I would assume that the CEO of a web-based company would spend at least enough time understanding his company’s own infrastructure to realize that he should be talking to his own engineers about why the failed to design a robust multi-zone backup solution on EC2, rather than imagining capabilities for EC2 that do not and have never been asserted to have existed.

I hope the upshot of this event will be more comprehensive and careful engineering of solutions deployed to the cloud.  I fear however, that given the tenor of the media coverage and customer reactions, the onus will not be placed where it belongs, on the customers’ own engineers, and instead will only result in undeserved bad press for Amazon.

For what its worth, we at Blue Gecko frequently help our customers deploy robust highly-available solutions on EC2, that would have easily recovered in the four-to-five hour time frame, or even experienced no outage at all, rather than the 40+ hour nightmare affecting some customers of EC2.

5 Responses to “EC2 outage reactions showcase widespread ignorance regarding the cloud”

  1. John Doe 23. Apr, 2011 at 5:57 am #

    Unfortuanatly it appears the writer of this article also does not understand the situation either.

    The part of the EC2 that failed was the EBS (Elastic Block Storage). These are storage volumes provided by Amazon for use with EC2 instances. What Amazon’s specifications guarantee is that all of your data in the EBS system is stored in two separate locations, master and slave copy. This prevents a single fault from disrupting both copies of the data.

    In the case of a failure in the master copy of the data, what should happen is that there is only a short down-time as The Client/Amazon reconfigure their site to use the slave copy. Amazon guarantee that (with the exception of a catastrophic failure) the slave copy will NEVER fail for the same reason as the master.

    What has happened is that Amazon have failed to meet their own specification. Where there should have been complete separation of the two systems there was a key overlap. This single part failed, causing both the Master and Slave EBS copies to become unavailable for the Client.

    The Writer claims that the Client should have foreseen this, and backed up all their data, so they could change to a different availability zone. However, this is in practice impossible, as each availability zone has a completely different interface, and would require at least a day to configure.

  2. jwilton 23. Apr, 2011 at 9:31 am #

    I very much appreciate the time the commenter took to respond to my post. However, the assertions in the above anonymous comment are untrue. EBS is not replicated between availability zones. From Amazon’s page on EBS:

    “Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component.”

    The same page goes on to specify EBS volume snapshots as the mechanism by which customers can protect and back up their EBS volumes:

    “Amazon EBS also provides the ability to create point-in-time snapshots of volumes, which are persisted to Amazon S3. These snapshots can be used as the starting point for new Amazon EBS volumes, and protect data for long-term durability….”
    “Amazon EBS provides the ability to back up point-in-time snapshots of your data to Amazon S3 for durable recovery…”
    “Snapshots can also be used to … move volumes across Availability Zones…”

    Finally, it is absolutely untrue that it would take a day to configure a switch to a new availability zone. For a customer using EBS-backed AMIs, who has regularly snapshotted their boot volumes, they would simply:
    Start a new instance of the same type as the one lost in a failed zone
    STOP that instance
    attach a new volume created from the snapshot as the boot device
    START the instance

  3. nate 23. Apr, 2011 at 10:43 am #

    I suspect Netflix didn’t have an outage because all of their video is served by Level 3 CDN.

  4. akpadhi 24. Apr, 2011 at 12:53 am #

    Author knowledge is commendable here. This clearly hints that data/services in the cloud does not resemble uptime. The design you implement today with the fallacy of uptime in mind will get snubbed the way we have experienced with EC2. You know it when you face it. Spending quality time in the first place to understand business needs and implementing/choosing the right design is crucial. I have had tough times convincing short-sighted management who try to save few bucks by trimming crucial aspects of the design with the reason “it will never happen to us”.

  5. Daniel Lieberman 26. Apr, 2011 at 11:12 am #

    I think the real issue here is that there is no parallel for this kind of outage in the non-cloud (i.e. conventional dedicated server) world. High availability isn’t something you do or you don’t have; you engineer for a certain set of failures. In the non-cloud world, if you’ve implemented disk and server redundancy and your application works correctly (and you don’t get DDoSed, etc.), usually the only reasons for failure are either network outages (which are usually relatively brief and require little or no recovery work) or power outages, which are theoretically engineered away and are in reality quite rare in high-end providers, and typically last no more than 3-4 hours (though they do often require quite a bit of recovery work). And neither of those is likely to involve data loss if you’re using good servers and appropriate storage redundancy.

    The same level of redundancy engineering that might have been considered adequate uptime protection in the non-cloud world would not have protected you in this AWS outage — it’s hard to know, but it seems that this scenario was a lot more likely than a power failure (which, of course, is still possible within an availability zone at Amazon too). In fact, it’s not clear what level of engineering would have sufficed, given that the problem affected multiple availability zones. But even more serious than the downtime was the data loss.

    A medium-to-large site might easily need 500 EBS volumes for sufficient I/O capacity. In the non-cloud world, disk failures are mostly uncorrelated, so the odds of multiple disk failures in a short period of time are very low, making RAID effective. According to Amazon, 0.07% of the EBS volumes in US-East were unrecoverable, and since most of all of these happened within one of the four availability zones, it was probably about 0.28% of the volumes in that zone that failed — and these are supposed to be redundant volumes. This may still sound like a small percentage, but if you you have 500 EBS volumes, but it puts you at a more than 75% chance that one of your “redundant” EBS volumes lost data, and makes it reasonably likely that more than one did.

    Arguably, there are two problems here that are inherent to the current state of cloud computing altogether — i.e., inherent to highly scaled, highly interconnected environments. First, as I described for disk failures, what would normally be uncorrelated failures can become correlated, defeating normal availability engineering. And second, failures can cascade. From Amazon’s updates it appears that the root cause of the problem was one of those normally minor network failures that I mentioned can happen with any provider. But in Amazon’s case it caused re-mirroring of EBS volumes and flooded capacity on components which don’t even exist for non-cloud providers, but are vital to customers at AWS.

    In other words, the same level of engineering can buy you much more availability in the non-cloud world. This will improve at some point, but not yet.

Leave a Reply