My Experience With the EC2 Judgment Day Outage

| 12 Comments

Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at once. Unfortunately, whatever happened last night seemed to cause problems in all us-east-1 zones for one of my accounts.

Of the 20+ full time EC2 instances that I am responsible for (including my startup and my personal servers) only one instance was affected. As it turns out it was the one that hosts Alestic.com along with some other services that are important to me personally.

Here are some steps I took in response to the outage:

  1. My instance was responding to pings, but no other services were responding. I tried a reboot through the EC2 API, but that did not seem to take effect as the pings never stopped responding.

  2. I then tried to stop/start the instance as this is an easy way to move an EBS boot instance to new hardware. Unfortunately, the instance stayed in a “stopping” state (for what turned out to be at least 7 hours).

  3. After the stop did not take effect in 20 minutes, I posted on the EC2 forum and noticed that a slew of others were having similar issues.

  4. I was regularly checking the AWS status page and within a short time, Amazon posted that they were having issues.

  5. With an instance wedged, my next move was to bring up a new replacement instance. This is fairly easy for me with complete documented steps and automated scripts that I have used in the past in similar situations.

    Unfortunately, my attempts to bring up new EBS boot instances failed in all four availability zones in us-east-1. In three of the zones the new instances went from pending to terminated. In the fourth zone, I got the insufficient capacity error for the c1.medium size.

    At this point, I realized that though my disaster recovery strategy for this particular server did span multiple availability zones, it did not extend beyond a single EC2 region. My databases are backed up outside of EC2, so I won’t lose key data, but there are other files that I would have to recreate in order for all systems to operate smoothly.

  6. I decided to trust Amazon to bring things back faster than I could rebuild everything in a new region and went to sleep for a few hours.

  7. Shortly after I woke up, I was able to start a new instance in a different availability zone. To move the services from the wedged instance to the new instance, I:

    a. Created snapshots of the EBS volumes on the wedged instance.

    b. Created new volumes in the new availability zone based on those snapshots.

    c. Stopped the new EBS boot instance.

    d. Detached the EBS boot volume from the new instance.

    e. Attached the new EBS volumes (created from the snapshots) to the new instance.

    f. Started the new EBS boot instance.

  8. Sure ‘nuff. The new instance looked and ran exactly like the old, wedged instance and I was able to assign my Elastic IP Address to the new instance and move on with my day.

Notes

  • Since AWS does not post their status updates to Twitter (that I’ve found), I follow ylastic. Ylastic sends out updates on Twitter as soon as Amazon updates the status page. In fact, Ylastic often sends out problem alerts before Amazon updates their status page!

  • Though I got many errors in trying to run API commands, I kept trying and some requests got through so that I was able to accomplish my recovery well before Amazon announced that things were getting better.

  • Judgment Day = 2011-04-21 (in some timelines)

Lessons learned

I’m happy with AWS and my current strategies except for one weakness: I need to keep backups of everything required to bring up my servers in a different EC2 region, not just a different availability zone. This would also work for moving to a different provider if that became necessary.

At my company, Campus Explorer, we have implemented this higher level of disaster recovery and have even tested it with a fire drill. We were able to recreate our entire service on a different hosting provider using only off-AWS backups in about 24 hours of work. Fortunately for my sleep last night, none of our company services were affected in today’s EC2 outage.

12 Comments

Fantastic article Eric! Thanks for sharing the info :)

I am the CTO of a startup using AWS and we had to suffer from this outage this morning.

About 2 weeks ago we moved one of the services to AWS following an outage on a dedicated server. We had no hot backup and were down for almost 20 hours although our (well-known) provider had told us via email that we should expect 1 hour recovery time just a few weeks prior.

For our new setup I provisioned 2 servers with MySQL Master-Slave replication, one on the East coast, the other one on the West coast. We had to switch over before the setup was finalized for monetary reasons (that's the life of a startup)...

Early this morning our master database hosted on EBS on US-EST-1 was not responding. There was no way to recover the service and I decided to switch over to the slave server on US-WEST-1.

It was much harder than anticipated because our setup and disaster recovery plan was not fully in place, but none of our customers have reported their service down. We were supposed to do our first drill over the weekend but we had a live opportunity to test this new setup and it was a lot of stress due to the nature of this service.

We had proper documentation that was crucial for the fast and full recovery.

AWS management console has worked flawlessly during the incident although we could not do snapshots on the East coast because the EBS volumes were not reachable.

Overall I am very happy with the end result as it rewarded a lot of hard work. Cloud or not, AWS is not better nor worse than other solutions. At least AWS makes it somewhat easy to setup over different regions when the other provider required that we have both our servers in the same rack! That's how they lost me and our startup as a customer.

Startups have do build for failures, hardware, network, power, data centers, but also human error (drop table replicated on slave server, oops) and vandalism (delete snapshots and everything else, #@!t).

Thank you Eric, for this post.

My experience has been exactly the same as yours right up to "Created snapshots of the EBS volumes on the wedged instance" which I've been unable to accomplish.

By good luck (not planning) I had an AMI backup from this past weekend which I was able to launch for the time being. If necessary I do have daily incrementals saved to S3 that I can restore.

Meanwhile I will give AWS some time to get the remaining issues resolved and go from there. I don't know what to expect from the wedged instance or the permanently pending snapshots.

Lessons learned for us are the same: Ensure that we have copies of everything in multiple regions and a plan to rapidly move to a different provider if needed.

Eric,

Good article. My company was also affected by today's outage and first thought was to bring up volumed in us-west from snapshot. I found that went i went into use-west, though, none of my us-east snapshots were available in my control panel. Is the ability to access snapshots across regions an api-only thing?

Thanks

Jeremy

Jeremy:

Each EC2 region is independent. They do not share AMIs, snapshots, ssh keys, or much of anything.

Let's hope this will push Amazon towards providing an easy way to transfer copies of snapshots to other regions without having to copy everything manually.

Jeremy, the fact that they don't share anything between regions turned out to be a good thing. In the east region, the multi-zone failure is probably related to some shared infrastructure, namely snapshots, AMIs, ssh keys, and personnel. We still don't know the cause, it could have been vandalism.

So in order to build the multi-region redundancy you need to do it yourself. This is why we have built our MySQL replication across the two regions that you mentioned.

This is also why I chose not to go with RDS, because it does not provide for muti-region redundancy. RDS is one level above the hardware which is why it can and should span across regions.

You can then do snapshots in one or more regions.

Now ideally the better is to do replication between two independent cloud providers because bugs can and will ripple across the world at a speed close to that of light.

This will eventually require standardized APIs between cloud providers, also allowing to move capacity based on demand, availability and pricing (e.g. spot instances).

Thanks for the post (and all the other general goodness on the site).

I too got stuck at the same point ianfhood did -- creating snapshots of the wedged instance(s).

tbochud:

I don't find it that difficult to copy stuff between regions, but an API solution for volumes, snapshots, AMIs, buckets without the need for running instances would be cool. Existing API calls are directed at and handled by a single region, so this cross-region interaction would be a new thing, but I'm sure Amazon's up to the challenge.

fmdan:

Yep, my "live" snapshot was stuck for hours and eventually got unwedged by the time I was able to start new instances in other availability zones. HOWEVER, this demonstrates the importance of creating regular snapshots. I could have restored my instance with a recent snapshot and the loss of a short period of data. I was just a bit luckier in that my new snapshot worked.

From what I've seen and heard of the outage so far, AWS users got back on their feet quickly as long as they were performing regular snapshots and architected so that they could move easily between availability zones.

If you can't suffer any short period of data loss, then it's important to keep streaming backups (e.g., replication) in multiple regions or service providers.

Without following these principles, this outage is going to hurt and you'll need to wait for AWS to restore the EBS volumes, which it sounds like they're doing steadily.

We've got a script that takes a new snapshot every 4 hours, and deletes an old one. But it does it all in the west coast.

I couldn't tell from any of the reports, was S3 available during the outage? Would it have been possible to copy the snapshot over. Also, how long does it take to copy an 8Gig snapshot between the coasts?

drormata:

As I understand it and based on my experience, S3 was not affected by the EC2 troubles, but attempting snapshots in the affected availability zone would hang.

To copy a snapshot between regions, you need to create a volume from the snapshot, attach/mount it on an instance, and copy it to another instance in the remote region, save to an EBS volume, and snapshot that volume. Speed depends on your compression algorithm and general Internet weather as regions are connected over public lines.

Leave a comment

More Entries

You Should Use EBS Boot Instances on Amazon EC2
EBS boot vs. instance-store If you are just getting started with Amazon EC2, then use EBS boot instances and stop…
Retrieve Public ssh Key From EC2
A serverfault poster had a problem that I thought was a cool challenge. I had so much fun coming up…
Running EC2 Instances on a Recurring Schedule with Auto Scaling
Do you want to run short jobs on Amazon EC2 on a recurring schedule, but don’t want to pay for…
AWS Virtual MFA and the Google Authenticator for Android
Amazon just announced that the AWS MFA (multi-factor authentication) now supports virtual or software MFA devices in addition to the…
Updated EBS boot AMIs for Ubuntu 8.04 Hardy on Amazon EC2 (2011-10-06)
Canonical has released updated instance-store AMIs for Ubuntu 8.04 LTS Hardy on Amazon EC2. Read Ben Howard’s announcement on the…
New Release of Alestic Git Server
New AMIs have been released for the Alestic Git Server. Major upgrade points include: Base operating system upgraded to Ubuntu…
Using ServerFault.com for Amazon EC2 Q&A
The Amazon EC2 Forum has been around since the beginning of EC2 and has always been a place where you…
Rebooting vs. Stop/Start of Amazon EC2 Instance
When you reboot a physical computer at your desk it is very similar to shutting down the system, and booting…
Upper Limits on Number of Amazon EC2 Instances by Region
[Update: As predicted, these numbers are already out of date and Amazon has added more public IP address ranges for…
Unavailable Availability Zones on Amazon EC2
I’m taking a class about using Chef with EC2 by Florian Drescher today and Florian mentioned that he noticed one…
Desktop AMI login security with NX
Update 2011-08-04: Amazon Security did more research and investigated the desktop AMIs. They have confirmed that their software incorrectly flagged…
Updated EBS boot AMIs for Ubuntu 8.04 Hardy on Amazon EC2
For folks still using the old, reliable Ubuntu 8.04 LTS Hardy from 2008, Canonical has released updated AMIs for use…
Creating Public AMIs Securely for EC2
Amazon published a tutorial about best practices in creating public AMIs for use on EC2 last week: How To Share…
Canonical Releases Ubuntu 11.04 Natty for Amazon EC2
As steady as clockwork, Ubuntu 11.04 Natty is released on the day scheduled at least eleven months ago; and thanks…
EC2 Reserved Instance Offering IDs Change Over Time
This article is a followup to Matching EC2 Availability Zones Across AWS Accounts written back in 2009. Please read that…
My Experience With the EC2 Judgment Day Outage
Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at…
Alestic Git Server (alpha testing)
I’m working on making it easy to start a centralized Git server with an unlimited number of private Git repositories…
Amazon EC2 Tokyo (ap-northeast-1) and Ubuntu AMIs
Amazon Web Services has launched a new EC2 region in Tokyo named ap-northeast-1. Canonical has released new AMIs in this…
Fixing Files on the Root EBS Volume of an EC2 Instance
You can examine and edit files on the root EBS volume on an EC2 instance even if you are in…
New Release of ec2-consistent-snapshot and Screencast by Ahmed Kamal
ec2-consistent-snapshot is a tool that uses the Amazon EC2 API to initiate a snapshot of an EBS volume with some…