My Experience With the EC2 Judgment Day Outage

Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at once. Unfortunately, whatever happened last night seemed to cause problems in all us-east-1 zones for one of my accounts.

Of the 20+ full time EC2 instances that I am responsible for (including my startup and my personal servers) only one instance was affected. As it turns out it was the one that hosts Alestic.com along with some other services that are important to me personally.

Here are some steps I took in response to the outage:

  1. My instance was responding to pings, but no other services were responding. I tried a reboot through the EC2 API, but that did not seem to take effect as the pings never stopped responding.

  2. I then tried to stop/start the instance as this is an easy way to move an EBS boot instance to new hardware. Unfortunately, the instance stayed in a “stopping” state (for what turned out to be at least 7 hours).

  3. After the stop did not take effect in 20 minutes, I posted on the EC2 forum and noticed that a slew of others were having similar issues.

  4. I was regularly checking the AWS status page and within a short time, Amazon posted that they were having issues.

  5. With an instance wedged, my next move was to bring up a new replacement instance. This is fairly easy for me with complete documented steps and automated scripts that I have used in the past in similar situations.

Unfortunately, my attempts to bring up new EBS boot instances failed in all four availability zones in us-east-1. In three of the zones the new instances went from pending to terminated. In the fourth zone, I got the insufficient capacity error for the c1.medium size.

At this point, I realized that though my disaster recovery strategy for this particular server did span multiple availability zones, it did not extend beyond a single EC2 region. My databases are backed up outside of EC2, so I won’t lose key data, but there are other files that I would have to recreate in order for all systems to operate smoothly.

  1. I decided to trust Amazon to bring things back faster than I could rebuild everything in a new region and went to sleep for a few hours.

  2. Shortly after I woke up, I was able to start a new instance in a different availability zone. To move the services from the wedged instance to the new instance, I:

a. Created snapshots of the EBS volumes on the wedged instance.

b. Created new volumes in the new availability zone based on those snapshots.

c. Stopped the new EBS boot instance.

d. Detached the EBS boot volume from the new instance.

e. Attached the new EBS volumes (created from the snapshots) to the new instance.

f. Started the new EBS boot instance.

  1. Sure ‘nuff. The new instance looked and ran exactly like the old, wedged instance and I was able to assign my Elastic IP Address to the new instance and move on with my day.

Notes

  • Since AWS does not post their status updates to Twitter (that I’ve found), I follow ylastic. Ylastic sends out updates on Twitter as soon as Amazon updates the status page. In fact, Ylastic often sends out problem alerts before Amazon updates their status page!

  • Though I got many errors in trying to run API commands, I kept trying and some requests got through so that I was able to accomplish my recovery well before Amazon announced that things were getting better.

  • Judgment Day = 2011-04-21 (in some timelines)

Lessons learned

I’m happy with AWS and my current strategies except for one weakness: I need to keep backups of everything required to bring up my servers in a different EC2 region, not just a different availability zone. This would also work for moving to a different provider if that became necessary.

At my company, Campus Explorer, we have implemented this higher level of disaster recovery and have even tested it with a fire drill. We were able to recreate our entire service on a different hosting provider using only off-AWS backups in about 24 hours of work. Fortunately for my sleep last night, none of our company services were affected in today’s EC2 outage.