Amazon designs availability zones so that it is extremely unlikely that a single failure will take out multiple zones at once. Unfortunately, whatever happened last night seemed to cause problems in all us-east-1 zones for one of my accounts.
Of the 20+ full time EC2 instances that I am responsible for (including my startup and my personal servers) only one instance was affected. As it turns out it was the one that hosts Alestic.com along with some other services that are important to me personally.
Here are some steps I took in response to the outage:
My instance was responding to pings, but no other services were responding. I tried a reboot through the EC2 API, but that did not seem to take effect as the pings never stopped responding.
I then tried to stop/start the instance as this is an easy way to move an EBS boot instance to new hardware. Unfortunately, the instance stayed in a “stopping” state (for what turned out to be at least 7 hours).
After the stop did not take effect in 20 minutes, I posted on the EC2 forum and noticed that a slew of others were having similar issues.
I was regularly checking the AWS status page and within a short time, Amazon posted that they were having issues.
With an instance wedged, my next move was to bring up a new replacement instance. This is fairly easy for me with complete documented steps and automated scripts that I have used in the past in similar situations.
Unfortunately, my attempts to bring up new EBS boot instances failed in all four availability zones in
us-east-1. In three of the zones the new instances went frompendingtoterminated. In the fourth zone, I got the insufficient capacity error for thec1.mediumsize.At this point, I realized that though my disaster recovery strategy for this particular server did span multiple availability zones, it did not extend beyond a single EC2 region. My databases are backed up outside of EC2, so I won’t lose key data, but there are other files that I would have to recreate in order for all systems to operate smoothly.
I decided to trust Amazon to bring things back faster than I could rebuild everything in a new region and went to sleep for a few hours.
Shortly after I woke up, I was able to start a new instance in a different availability zone. To move the services from the wedged instance to the new instance, I:
a. Created snapshots of the EBS volumes on the wedged instance.
b. Created new volumes in the new availability zone based on those snapshots.
c. Stopped the new EBS boot instance.
d. Detached the EBS boot volume from the new instance.
e. Attached the new EBS volumes (created from the snapshots) to the new instance.
f. Started the new EBS boot instance.
Sure ‘nuff. The new instance looked and ran exactly like the old, wedged instance and I was able to assign my Elastic IP Address to the new instance and move on with my day.
Notes
Since AWS does not post their status updates to Twitter (that I’ve found), I follow ylastic. Ylastic sends out updates on Twitter as soon as Amazon updates the status page. In fact, Ylastic often sends out problem alerts before Amazon updates their status page!
Though I got many errors in trying to run API commands, I kept trying and some requests got through so that I was able to accomplish my recovery well before Amazon announced that things were getting better.
Judgment Day = 2011-04-21 (in some timelines)
Lessons learned
I’m happy with AWS and my current strategies except for one weakness: I need to keep backups of everything required to bring up my servers in a different EC2 region, not just a different availability zone. This would also work for moving to a different provider if that became necessary.
At my company, Campus Explorer, we have implemented this higher level of disaster recovery and have even tested it with a fire drill. We were able to recreate our entire service on a different hosting provider using only off-AWS backups in about 24 hours of work. Fortunately for my sleep last night, none of our company services were affected in today’s EC2 outage.


Fantastic article Eric! Thanks for sharing the info :)
I am the CTO of a startup using AWS and we had to suffer from this outage this morning.
About 2 weeks ago we moved one of the services to AWS following an outage on a dedicated server. We had no hot backup and were down for almost 20 hours although our (well-known) provider had told us via email that we should expect 1 hour recovery time just a few weeks prior.
For our new setup I provisioned 2 servers with MySQL Master-Slave replication, one on the East coast, the other one on the West coast. We had to switch over before the setup was finalized for monetary reasons (that's the life of a startup)...
Early this morning our master database hosted on EBS on US-EST-1 was not responding. There was no way to recover the service and I decided to switch over to the slave server on US-WEST-1.
It was much harder than anticipated because our setup and disaster recovery plan was not fully in place, but none of our customers have reported their service down. We were supposed to do our first drill over the weekend but we had a live opportunity to test this new setup and it was a lot of stress due to the nature of this service.
We had proper documentation that was crucial for the fast and full recovery.
AWS management console has worked flawlessly during the incident although we could not do snapshots on the East coast because the EBS volumes were not reachable.
Overall I am very happy with the end result as it rewarded a lot of hard work. Cloud or not, AWS is not better nor worse than other solutions. At least AWS makes it somewhat easy to setup over different regions when the other provider required that we have both our servers in the same rack! That's how they lost me and our startup as a customer.
Startups have do build for failures, hardware, network, power, data centers, but also human error (drop table replicated on slave server, oops) and vandalism (delete snapshots and everything else, #@!t).
Thank you Eric, for this post.
My experience has been exactly the same as yours right up to "Created snapshots of the EBS volumes on the wedged instance" which I've been unable to accomplish.
By good luck (not planning) I had an AMI backup from this past weekend which I was able to launch for the time being. If necessary I do have daily incrementals saved to S3 that I can restore.
Meanwhile I will give AWS some time to get the remaining issues resolved and go from there. I don't know what to expect from the wedged instance or the permanently pending snapshots.
Lessons learned for us are the same: Ensure that we have copies of everything in multiple regions and a plan to rapidly move to a different provider if needed.
Eric,
Good article. My company was also affected by today's outage and first thought was to bring up volumed in us-west from snapshot. I found that went i went into use-west, though, none of my us-east snapshots were available in my control panel. Is the ability to access snapshots across regions an api-only thing?
Thanks
Jeremy
Jeremy:
Each EC2 region is independent. They do not share AMIs, snapshots, ssh keys, or much of anything.
Let's hope this will push Amazon towards providing an easy way to transfer copies of snapshots to other regions without having to copy everything manually.
Jeremy, the fact that they don't share anything between regions turned out to be a good thing. In the east region, the multi-zone failure is probably related to some shared infrastructure, namely snapshots, AMIs, ssh keys, and personnel. We still don't know the cause, it could have been vandalism.
So in order to build the multi-region redundancy you need to do it yourself. This is why we have built our MySQL replication across the two regions that you mentioned.
This is also why I chose not to go with RDS, because it does not provide for muti-region redundancy. RDS is one level above the hardware which is why it can and should span across regions.
You can then do snapshots in one or more regions.
Now ideally the better is to do replication between two independent cloud providers because bugs can and will ripple across the world at a speed close to that of light.
This will eventually require standardized APIs between cloud providers, also allowing to move capacity based on demand, availability and pricing (e.g. spot instances).
Thanks for the post (and all the other general goodness on the site).
I too got stuck at the same point ianfhood did -- creating snapshots of the wedged instance(s).
tbochud:
I don't find it that difficult to copy stuff between regions, but an API solution for volumes, snapshots, AMIs, buckets without the need for running instances would be cool. Existing API calls are directed at and handled by a single region, so this cross-region interaction would be a new thing, but I'm sure Amazon's up to the challenge.
fmdan:
Yep, my "live" snapshot was stuck for hours and eventually got unwedged by the time I was able to start new instances in other availability zones. HOWEVER, this demonstrates the importance of creating regular snapshots. I could have restored my instance with a recent snapshot and the loss of a short period of data. I was just a bit luckier in that my new snapshot worked.
From what I've seen and heard of the outage so far, AWS users got back on their feet quickly as long as they were performing regular snapshots and architected so that they could move easily between availability zones.
If you can't suffer any short period of data loss, then it's important to keep streaming backups (e.g., replication) in multiple regions or service providers.
Without following these principles, this outage is going to hurt and you'll need to wait for AWS to restore the EBS volumes, which it sounds like they're doing steadily.
We've got a script that takes a new snapshot every 4 hours, and deletes an old one. But it does it all in the west coast.
I couldn't tell from any of the reports, was S3 available during the outage? Would it have been possible to copy the snapshot over. Also, how long does it take to copy an 8Gig snapshot between the coasts?
drormata:
As I understand it and based on my experience, S3 was not affected by the EC2 troubles, but attempting snapshots in the affected availability zone would hang.
To copy a snapshot between regions, you need to create a volume from the snapshot, attach/mount it on an instance, and copy it to another instance in the remote region, save to an EBS volume, and snapshot that volume. Speed depends on your compression algorithm and general Internet weather as regions are connected over public lines.