Opinion: EC2 Outage Was Not an Outage

| 8 Comments

The Twitter wires are aflame with cute quotes on how lightning from a “cloud” took down Amazon’s EC2 “cloud” service. Snarky snippets sell well on Twitter with no research or understanding of the facts behind the issues involved.

Since “the press” is now asking for my opinion, I figured I’d jot down a quick overview of my thoughts on this non-event which has been blown out of proportion. Sorry the press, we’re all the press now (for better or for worse) but you’re welcome to extract quotes with proper attribution :)

I don’t consider lighting taking out some racks of EC2 servers to be an “outage” even though this took down some customers’ running instances. EC2 and the rest of AWS were completely functional. If one or more EC2 instances fail for internal or external reasons, any customer who has built a reasonable elastic architecture on EC2 should be able automatically or even manually to fire up new servers and to fail over with very little downtime, if any.

This was a “failure” or an “error” or a “fault”, not an outage. Architectures built on top of AWS should expect and plan for failures; that’s simply the way the service was designed. AWS provides dramatic resources for detecting and dealing with big and small failures and for building highly redundant, fault tolerant, distributed systems at a global level—instead of at an individual API call or EC2 instance level.

At a normal ISP, if your server goes down, it is a serious problem. You have to wait for the ISP to work to bring it up or drive over to the data center and work on it yourself. With EC2, servers are fairly disposable. When an EC2 server goes down (which is still rare) you have at your fingertips thousands of other servers in a half dozen data centers in multiple countries.

A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers. EC2 provides advanced data center capabilities few companies can build on their own.

Yes, it can take some time and effort to learn this new way of working with on-demand, self-service, pay-as-you-go hardware infrastructure and sometimes the lessons are learned the hard way, but you’ll be better off in the end.

8 Comments

Your argument is a little inane -- Amazon sells servers, not clusters, and plenty of their customers run single servers.

Whether this is a good idea or not, it doesn't change the fact that a partial outage is still an outage.

Thanks for your comment. I'm expressing an opinionated stance on this to help offset all of the opinions that are being expressed on the other side :)

Yes, Amazon sells (rents) servers which are the building blocks for you to use in designing your architecture. Amazon explicitly does *not* sell expensive, highly redundant, fault tolerant servers, with permanent storage and frequent backups. They rent hours on near-commodity hardware where even the local disk vanishes if it goes down accidentally.

It looks like Amazon took pains to try to restore customer instances in this case, but to me this is above and beyond the call of duty and most of those customers should have already started new instances and moved on by the time the old ones were able to be restored.

I don't consider an EC2 server going down to be an outage or even a "partial outage". It's a failure, and we should be building architectures which handle occasional failures, even on a single server.

Though I don't think EC2 is necessarily the best place for single servers, I do run single servers on EC2 for my personal use. If one goes down, I get notified within seconds and I can launch a replacement server within minutes using installation and customization scripts, EBS volumes, and Elastic IPs, the building blocks provided by AWS. If I cared more about uptime on these services, I would use multiple servers and automatic failover.

I should also point folks to the EC2 SLA:
http://aws.amazon.com/ec2-sla/
You'll note that Amazon does not consider it an outage when a complete data center (availability zone) gets wiped off the map. Two whole data centers in a single region have to be unavailable before the SLA starts to think that there is a serious problem.

That's how big Amazon thinks and how big we must start thinking before we complain about outages. The amazing thing is that AWS provides the tools to survive a complete data center failure and get back up and running promptly. What low-budget startup was able to do that a couple years ago?

I think even "failure" or "error" is a still bit strong. Servers (and the data centres that house them) are machines and they will fail at some point. It's not a matter of if, but when. So this really was just normal operation - a non-event really. Amazon don't give a 100% SLA on a single zone - obviously they anticipate the occasional failure, so why should users think they know better?

At least Amazon give you easy access to multiple availability zones and additional regions. If your application is critical enough, then you need to design it to make use of those resources. Like any tool, you have to use it the right way.

Unless it's an extended power outage, Amazon doesn't give you any advantages whatsoever. You can't just bring back up a downed database server in minutes, I don't care what all the tutorials say. A power outage for us is usually going to involve about an hour of downtime, even if the outage is just for 5 minutes. And for us, that's serious.

Treating failures as the norm is not a workable solution for most companies. It's not cost effective. Failures you plan for yes, but being the norm is crazy. You cannot equate how someone like Amazon or google uses the cloud to how smaller companies use the cloud, they are completely different scenarios. The could is only part of your architecture. You don't start out by building the same architecture that a Google has.

That said, I think Amazon has reasonble uptime, and their cloud does give you protection from extended outages which is definitely a plus. Our data center recently went out for 7 hours, and that would have been shortened significantly if we were on ec2. Unfortunately, ec2 isn't really ideal for running high volume databases, so we can't use it for now, but that's another story.

snacktime: Not true.

I spent about a working day figuring out how to use EC2 and its associated services to build a fault-tolerant server infrastructure for a single server.

I'm currently running a single small instance with Amazon that runs a small but steadily-busy web application that has about a 5GB database. In the case of failure, I can launch a new fully-operational server in about three minutes, with at most one hour of lost data. I know this, because I've done it a few times to make sure I had the process down cold.

For all this, I'm paying about $50/mo.

heya,

beandog: Yeah, I have a similar single-server setup (LAMP), trying to figure out how to make it fault-tolerant as well. Do you think you could provide any notes on your setup.

I currently have EBS setup, but need to have it so that it spins up a new instance if the first one dies, and takes over.

It's been mentioned that I can use ELB (Elastic Load Balancing) and Cloudwatch. Not quite sure how this will work.

Also, are there alternative solutions?

Cheers,
Victor

Leave a comment

Ubuntu AMIs

Ubuntu AMIs for EC2:


More Entries

Replacing a CloudFront Distribution to "Invalidate" All Objects
I was chatting with Kevin Boyd (aka Beryllium) on the ##aws Freenode IRC channel about the challenge of invalidating a…
Email Alerts for AWS Billing Alarms
using CloudWatch and SNS to send yourself email messages when AWS costs accrue past limits you define The Amazon documentation…
Cost of Transitioning S3 Objects to Glacier
how I was surprised by a large AWS charge and how to calculate the break-even point Glacier Archival of S3…
Running Ubuntu on Amazon EC2 in Sydney, Australia
Amazon has announced a new AWS region in Sydney, Australia with the name ap-southeast-2. The official Ubuntu AMI lookup pages…
Save Money by Giving Away Unused Heavy Utilization Reserved Instances
You may be able to save on future EC2 expenses by selling an unused Reserved Instance for less than its…
Installing AWS Command Line Tools from Amazon Downloads
When you need an AWS command line toolset not provided by Ubuntu packages, you can download the tools directly from…
Convert Running EC2 Instance to EBS-Optimized Instance with Provisioned IOPS EBS Volumes
Amazon just announced two related features for getting super-fast, consistent performance with EBS volumes: (1) Provisioned IOPS EBS volumes, and…
Which EC2 Availability Zone is Affected by an Outage?
Did you know that Amazon includes status messages about the health of availability zones in the output of the ec2-describe-availability-zones…
Installing AWS Command Line Tools Using Ubuntu Packages
Here are the steps for installing the AWS command line tools that are currently available as Ubuntu packages. These include:…
Ubuntu Developer Summit, May 2012 (Oakland)
I will be attending the Ubuntu Developer Summit (UDS) next week in Oakland, CA.  This event brings people from around…
Uploading Known ssh Host Key in EC2 user-data Script
The ssh protocol uses two different keys to keep you secure: The user ssh key is the one we normally…
Seeding Torrents with Amazon S3 and s3cmd on Ubuntu
Amazon Web Services is such a huge, complex service with so many products and features that sometimes very simple but…
CloudCamp
There are a number of CloudCamp events coming up in cities around the world. These are free events, organized around…
Use the Same Architecture (64-bit) on All EC2 Instance Types
A few hours ago, Amazon AWS announced that all EC2 instance types can now run 64-bit AMIs. Though t1.micro, m1.small,…
ec2-consistent-snapshot on GitHub and v0.43 Released
The source for ec2-conssitent-snapshot has historically been available here: ec2-consistent-snapshot on Launchpad.net using Bazaar For your convenience, it is now…
You Should Use EBS Boot Instances on Amazon EC2
EBS boot vs. instance-store If you are just getting started with Amazon EC2, then use EBS boot instances and stop…
Retrieve Public ssh Key From EC2
A serverfault poster had a problem that I thought was a cool challenge. I had so much fun coming up…
Running EC2 Instances on a Recurring Schedule with Auto Scaling
Do you want to run short jobs on Amazon EC2 on a recurring schedule, but don’t want to pay for…
AWS Virtual MFA and the Google Authenticator for Android
Amazon just announced that the AWS MFA (multi-factor authentication) now supports virtual or software MFA devices in addition to the…
Updated EBS boot AMIs for Ubuntu 8.04 Hardy on Amazon EC2 (2011-10-06)
Canonical has released updated instance-store AMIs for Ubuntu 8.04 LTS Hardy on Amazon EC2. Read Ben Howard’s announcement on the…