My Experience With the EC2 Judgment Day Outage

| 13 Comments

Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at once. Unfortunately, whatever happened last night seemed to cause problems in all us-east-1 zones for one of my accounts.

Of the 20+ full time EC2 instances that I am responsible for (including my startup and my personal servers) only one instance was affected. As it turns out it was the one that hosts Alestic.com along with some other services that are important to me personally.

Here are some steps I took in response to the outage:

  1. My instance was responding to pings, but no other services were responding. I tried a reboot through the EC2 API, but that did not seem to take effect as the pings never stopped responding.

  2. I then tried to stop/start the instance as this is an easy way to move an EBS boot instance to new hardware. Unfortunately, the instance stayed in a “stopping” state (for what turned out to be at least 7 hours).

  3. After the stop did not take effect in 20 minutes, I posted on the EC2 forum and noticed that a slew of others were having similar issues.

  4. I was regularly checking the AWS status page and within a short time, Amazon posted that they were having issues.

  5. With an instance wedged, my next move was to bring up a new replacement instance. This is fairly easy for me with complete documented steps and automated scripts that I have used in the past in similar situations.

    Unfortunately, my attempts to bring up new EBS boot instances failed in all four availability zones in us-east-1. In three of the zones the new instances went from pending to terminated. In the fourth zone, I got the insufficient capacity error for the c1.medium size.

    At this point, I realized that though my disaster recovery strategy for this particular server did span multiple availability zones, it did not extend beyond a single EC2 region. My databases are backed up outside of EC2, so I won’t lose key data, but there are other files that I would have to recreate in order for all systems to operate smoothly.

  6. I decided to trust Amazon to bring things back faster than I could rebuild everything in a new region and went to sleep for a few hours.

  7. Shortly after I woke up, I was able to start a new instance in a different availability zone. To move the services from the wedged instance to the new instance, I:

    a. Created snapshots of the EBS volumes on the wedged instance.

    b. Created new volumes in the new availability zone based on those snapshots.

    c. Stopped the new EBS boot instance.

    d. Detached the EBS boot volume from the new instance.

    e. Attached the new EBS volumes (created from the snapshots) to the new instance.

    f. Started the new EBS boot instance.

  8. Sure ‘nuff. The new instance looked and ran exactly like the old, wedged instance and I was able to assign my Elastic IP Address to the new instance and move on with my day.

Notes

  • Since AWS does not post their status updates to Twitter (that I’ve found), I follow ylastic. Ylastic sends out updates on Twitter as soon as Amazon updates the status page. In fact, Ylastic often sends out problem alerts before Amazon updates their status page!

  • Though I got many errors in trying to run API commands, I kept trying and some requests got through so that I was able to accomplish my recovery well before Amazon announced that things were getting better.

  • Judgment Day = 2011-04-21 (in some timelines)

Lessons learned

I’m happy with AWS and my current strategies except for one weakness: I need to keep backups of everything required to bring up my servers in a different EC2 region, not just a different availability zone. This would also work for moving to a different provider if that became necessary.

At my company, Campus Explorer, we have implemented this higher level of disaster recovery and have even tested it with a fire drill. We were able to recreate our entire service on a different hosting provider using only off-AWS backups in about 24 hours of work. Fortunately for my sleep last night, none of our company services were affected in today’s EC2 outage.

13 Comments

Fantastic article Eric! Thanks for sharing the info :)

I am the CTO of a startup using AWS and we had to suffer from this outage this morning.

About 2 weeks ago we moved one of the services to AWS following an outage on a dedicated server. We had no hot backup and were down for almost 20 hours although our (well-known) provider had told us via email that we should expect 1 hour recovery time just a few weeks prior.

For our new setup I provisioned 2 servers with MySQL Master-Slave replication, one on the East coast, the other one on the West coast. We had to switch over before the setup was finalized for monetary reasons (that's the life of a startup)...

Early this morning our master database hosted on EBS on US-EST-1 was not responding. There was no way to recover the service and I decided to switch over to the slave server on US-WEST-1.

It was much harder than anticipated because our setup and disaster recovery plan was not fully in place, but none of our customers have reported their service down. We were supposed to do our first drill over the weekend but we had a live opportunity to test this new setup and it was a lot of stress due to the nature of this service.

We had proper documentation that was crucial for the fast and full recovery.

AWS management console has worked flawlessly during the incident although we could not do snapshots on the East coast because the EBS volumes were not reachable.

Overall I am very happy with the end result as it rewarded a lot of hard work. Cloud or not, AWS is not better nor worse than other solutions. At least AWS makes it somewhat easy to setup over different regions when the other provider required that we have both our servers in the same rack! That's how they lost me and our startup as a customer.

Startups have do build for failures, hardware, network, power, data centers, but also human error (drop table replicated on slave server, oops) and vandalism (delete snapshots and everything else, #@!t).

Thank you Eric, for this post.

My experience has been exactly the same as yours right up to "Created snapshots of the EBS volumes on the wedged instance" which I've been unable to accomplish.

By good luck (not planning) I had an AMI backup from this past weekend which I was able to launch for the time being. If necessary I do have daily incrementals saved to S3 that I can restore.

Meanwhile I will give AWS some time to get the remaining issues resolved and go from there. I don't know what to expect from the wedged instance or the permanently pending snapshots.

Lessons learned for us are the same: Ensure that we have copies of everything in multiple regions and a plan to rapidly move to a different provider if needed.

Eric,

Good article. My company was also affected by today's outage and first thought was to bring up volumed in us-west from snapshot. I found that went i went into use-west, though, none of my us-east snapshots were available in my control panel. Is the ability to access snapshots across regions an api-only thing?

Thanks

Jeremy

Jeremy:

Each EC2 region is independent. They do not share AMIs, snapshots, ssh keys, or much of anything.

Let's hope this will push Amazon towards providing an easy way to transfer copies of snapshots to other regions without having to copy everything manually.

Jeremy, the fact that they don't share anything between regions turned out to be a good thing. In the east region, the multi-zone failure is probably related to some shared infrastructure, namely snapshots, AMIs, ssh keys, and personnel. We still don't know the cause, it could have been vandalism.

So in order to build the multi-region redundancy you need to do it yourself. This is why we have built our MySQL replication across the two regions that you mentioned.

This is also why I chose not to go with RDS, because it does not provide for muti-region redundancy. RDS is one level above the hardware which is why it can and should span across regions.

You can then do snapshots in one or more regions.

Now ideally the better is to do replication between two independent cloud providers because bugs can and will ripple across the world at a speed close to that of light.

This will eventually require standardized APIs between cloud providers, also allowing to move capacity based on demand, availability and pricing (e.g. spot instances).

Thanks for the post (and all the other general goodness on the site).

I too got stuck at the same point ianfhood did -- creating snapshots of the wedged instance(s).

tbochud:

I don't find it that difficult to copy stuff between regions, but an API solution for volumes, snapshots, AMIs, buckets without the need for running instances would be cool. Existing API calls are directed at and handled by a single region, so this cross-region interaction would be a new thing, but I'm sure Amazon's up to the challenge.

fmdan:

Yep, my "live" snapshot was stuck for hours and eventually got unwedged by the time I was able to start new instances in other availability zones. HOWEVER, this demonstrates the importance of creating regular snapshots. I could have restored my instance with a recent snapshot and the loss of a short period of data. I was just a bit luckier in that my new snapshot worked.

From what I've seen and heard of the outage so far, AWS users got back on their feet quickly as long as they were performing regular snapshots and architected so that they could move easily between availability zones.

If you can't suffer any short period of data loss, then it's important to keep streaming backups (e.g., replication) in multiple regions or service providers.

Without following these principles, this outage is going to hurt and you'll need to wait for AWS to restore the EBS volumes, which it sounds like they're doing steadily.

We've got a script that takes a new snapshot every 4 hours, and deletes an old one. But it does it all in the west coast.

I couldn't tell from any of the reports, was S3 available during the outage? Would it have been possible to copy the snapshot over. Also, how long does it take to copy an 8Gig snapshot between the coasts?

drormata:

As I understand it and based on my experience, S3 was not affected by the EC2 troubles, but attempting snapshots in the affected availability zone would hang.

To copy a snapshot between regions, you need to create a volume from the snapshot, attach/mount it on an instance, and copy it to another instance in the remote region, save to an EBS volume, and snapshot that volume. Speed depends on your compression algorithm and general Internet weather as regions are connected over public lines.

Some of our customers do mirroring between EBS volumes and Zadara Storage volumes. Due to the fact that Zadara Storage volumes are in a totally different system, they are not affected when EBS goes down.
Zadara Storage volumes are accessible from all AZs in US East.

http://blog.zadarastorage.com/2012/10/comparing-provisioned-iops-ebs-vs.html

Leave a comment

Ubuntu AMIs

Ubuntu AMIs for EC2:


More Entries

EBS-SSD Boot AMIs For Ubuntu On Amazon EC2
With Amazon’s announcement that SSD is now available for EBS volumes, they have also declared this the recommended EBS volume type. The good folks at Canonical are now building Ubuntu…
EC2 create-image Does Not Fully "Stop" The Instance
The EC2 create-image API/command/console action is a convenient trigger to create an AMI from a running (or stopped) EBS boot instance. It takes a snapshot of the instance’s EBS volume(s)…
Finding the Region for an AWS Resource ID
use concurrent AWS command line requests to search the world for your instance, image, volume, snapshot, … Background Amazon EC2 and many other AWS services are divided up into various…
Changing The Default "ubuntu" Username On New EC2 Instances
configure your own ssh username in user-data The official Ubuntu AMIs create a default user with the username ubuntu which is used for the initial ssh access, i.e.: ssh ubuntu@<HOST>…
Default ssh Usernames For Connecting To EC2 Instances
Each AMI publisher on EC2 decides what user (or users) should have ssh access enabled by default and what ssh credentials should allow you to gain access as that user.…
New c3.* Instance Types on Amazon EC2 - Nice!
Worth switching. Amazon shared that the new c3.* instance types have been in high demand on EC2 since they were released. I finally had a minute to take a look…
Query EC2 Account Limits with AWS API
Here’s a useful tip mentioned in one of the sessions at AWS re:Invent this year. There is a little known API call that lets you query some of the EC2…
Using aws-cli --query Option To Simplify Output
My favorite session at AWS re:Invent was James Saryerwinnie’s clear, concise, and informative tour of the aws-cli (command line interface), which according to GitHub logs he is enhancing like crazy.…
Reset S3 Object Timestamp for Bucket Lifecycle Expiration
use aws-cli to extend expiration and restart the delete or archive countdown on objects in an S3 bucket Background S3 buckets allow you to specify lifecycle rules that tell AWS…
Installing aws-cli, the New AWS Command Line Tool
consistent control over more AWS services with aws-cli, a single, powerful command line tool from Amazon Readers of this tech blog know that I am a fan of the power…
Using An AWS CloudFormation Stack To Allow "-" Instead Of "+" In Gmail Email Addresses
Launch a CloudFormation template to set up a stack of AWS resources to fill a simple need: Supporting Gmail addresses with “-” instead of “+” separating the user name from…
New Options In ec2-expire-snapshots v0.11
The ec2-expire-snapshots program can be used to expire EBS snapshots in Amazon EC2 on a regular schedule that you define. It can be used as a companion to ec2-consistent-snapshot or…
Replacing a CloudFront Distribution to "Invalidate" All Objects
I was chatting with Kevin Boyd (aka Beryllium) on the ##aws Freenode IRC channel about the challenge of invalidating a large number of CloudFront objects (35,000) due to a problem…
Email Alerts for AWS Billing Alarms
using CloudWatch and SNS to send yourself email messages when AWS costs accrue past limits you define The Amazon documentation describes how to use the AWS console to monitor your…
Cost of Transitioning S3 Objects to Glacier
how I was surprised by a large AWS charge and how to calculate the break-even point Glacier Archival of S3 Objects Amazon recently introduced a fantastic new feature where S3…
Running Ubuntu on Amazon EC2 in Sydney, Australia
Amazon has announced a new AWS region in Sydney, Australia with the name ap-southeast-2. The official Ubuntu AMI lookup pages (1, 2) don’t seem to be showing the new location…
Save Money by Giving Away Unused Heavy Utilization Reserved Instances
You may be able to save on future EC2 expenses by selling an unused Reserved Instance for less than its true value or even $0.01, provided it is in the…
Installing AWS Command Line Tools from Amazon Downloads
When you need an AWS command line toolset not provided by Ubuntu packages, you can download the tools directly from Amazon and install them locally. In a previous article I…
Convert Running EC2 Instance to EBS-Optimized Instance with Provisioned IOPS EBS Volumes
Amazon just announced two related features for getting super-fast, consistent performance with EBS volumes: (1) Provisioned IOPS EBS volumes, and (2) EBS-Optimized Instances. Starting new instances and EBS volumes with…
Which EC2 Availability Zone is Affected by an Outage?
Did you know that Amazon includes status messages about the health of availability zones in the output of the ec2-describe-availability-zones command, the associated API call, and the AWS console? Right…
Installing AWS Command Line Tools Using Ubuntu Packages
See also: Installing AWS Command Line Tools from Amazon Downloads Here are the steps for installing the AWS command line tools that are currently available as Ubuntu packages. These include:…
Ubuntu Developer Summit, May 2012 (Oakland)
I will be attending the Ubuntu Developer Summit (UDS) next week in Oakland, CA. ┬áThis event brings people from around the world together in one place every six months to…
Uploading Known ssh Host Key in EC2 user-data Script
The ssh protocol uses two different keys to keep you secure: The user ssh key is the one we normally think of. This authenticates us to the remote host, proving…
Seeding Torrents with Amazon S3 and s3cmd on Ubuntu
Amazon Web Services is such a huge, complex service with so many products and features that sometimes very simple but powerful features fall through the cracks when you’re reading the…
CloudCamp
There are a number of CloudCamp events coming up in cities around the world. These are free events, organized around the various concepts, technologies, and services that fall under the…
Use the Same Architecture (64-bit) on All EC2 Instance Types
A few hours ago, Amazon AWS announced that all EC2 instance types can now run 64-bit AMIs. Though t1.micro, m1.small, and c1.medium will continue to also support 32-bit AMIs, it…
ec2-consistent-snapshot on GitHub and v0.43 Released
The source for ec2-conssitent-snapshot has historically been available here: ec2-consistent-snapshot on Launchpad.net using Bazaar For your convenience, it is now also available here: ec2-consistent-snapshot on GitHub using Git You are…
You Should Use EBS Boot Instances on Amazon EC2
EBS boot vs. instance-store If you are just getting started with Amazon EC2, then use EBS boot instances and stop reading this article. Forget that you ever heard about instance-store…
Retrieve Public ssh Key From EC2
A serverfault poster had a problem that I thought was a cool challenge. I had so much fun coming up with this answer, I figured I’d share it here as…
Running EC2 Instances on a Recurring Schedule with Auto Scaling
Do you want to run short jobs on Amazon EC2 on a recurring schedule, but don’t want to pay for an instance running all the time? Would you like to…