Which EC2 Availability Zone is Affected by an Outage?

Did you know that Amazon includes status messages about the health of availability zones in the output of the ec2-describe-availability-zones command, the associated API call, and the AWS console?

Right now, Amazon is restoring power to a “large number of instances” in one availability zone in the us-east-1 region due to “electrical storms in the area”.

Since the names used for specific availability zones differ between AWS accounts, Amazon can’t just say that the affected zone is us-east-1c as it might be us-east-1e in another account.

During this outage, you can find out what the name of the affected availability zone is in your AWS account by running this command (installation instructions):

My Experience With the EC2 Judgment Day Outage

Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at once. Unfortunately, whatever happened last night seemed to cause problems in all us-east-1 zones for one of my accounts.

Of the 20+ full time EC2 instances that I am responsible for (including my startup and my personal servers) only one instance was affected. As it turns out it was the one that hosts Alestic.com along with some other services that are important to me personally.

Here are some steps I took in response to the outage:

Opinion: EC2 Outage Was Not an Outage

The Twitter wires are aflame with cute quotes on how lightning from a “cloud” took down Amazon’s EC2 “cloud” service. Snarky snippets sell well on Twitter with no research or understanding of the facts behind the issues involved.

Since “the press” is now asking for my opinion, I figured I’d jot down a quick overview of my thoughts on this non-event which has been blown out of proportion. Sorry the press, we’re all the press now (for better or for worse) but you’re welcome to extract quotes with proper attribution :)

I don’t consider lighting taking out some racks of EC2 servers to be an “outage” even though this took down some customers' running instances. EC2 and the rest of AWS were completely functional. If one or more EC2 instances fail for internal or external reasons, any customer who has built a reasonable elastic architecture on EC2 should be able automatically or even manually to fire up new servers and to fail over with very little downtime, if any.

This was a “failure” or an “error” or a “fault”, not an outage. Architectures built on top of AWS should expect and plan for failures; that’s simply the way the service was designed. AWS provides dramatic resources for detecting and dealing with big and small failures and for building highly redundant, fault tolerant, distributed systems at a global level–instead of at an individual API call or EC2 instance level.

At a normal ISP, if your server goes down, it is a serious problem. You have to wait for the ISP to work to bring it up or drive over to the data center and work on it yourself. With EC2, servers are fairly disposable. When an EC2 server goes down (which is still rare) you have at your fingertips thousands of other servers in a half dozen data centers in multiple countries.

A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers. EC2 provides advanced data center capabilities few companies can build on their own.

Yes, it can take some time and effort to learn this new way of working with on-demand, self-service, pay-as-you-go hardware infrastructure and sometimes the lessons are learned the hard way, but you’ll be better off in the end.