The Twitter wires are aflame with cute quotes on how lightning from a “cloud” took down Amazon’s EC2 “cloud” service. Snarky snippets sell well on Twitter with no research or understanding of the facts behind the issues involved.
Since “the press” is now asking for my opinion, I figured I’d jot down a quick overview of my thoughts on this non-event which has been blown out of proportion. Sorry the press, we’re all the press now (for better or for worse) but you’re welcome to extract quotes with proper attribution :)
I don’t consider lighting taking out some racks of EC2 servers to be an “outage” even though this took down some customers’ running instances. EC2 and the rest of AWS were completely functional. If one or more EC2 instances fail for internal or external reasons, any customer who has built a reasonable elastic architecture on EC2 should be able automatically or even manually to fire up new servers and to fail over with very little downtime, if any.
This was a “failure” or an “error” or a “fault”, not an outage. Architectures built on top of AWS should expect and plan for failures; that’s simply the way the service was designed. AWS provides dramatic resources for detecting and dealing with big and small failures and for building highly redundant, fault tolerant, distributed systems at a global level—instead of at an individual API call or EC2 instance level.
At a normal ISP, if your server goes down, it is a serious problem. You have to wait for the ISP to work to bring it up or drive over to the data center and work on it yourself. With EC2, servers are fairly disposable. When an EC2 server goes down (which is still rare) you have at your fingertips thousands of other servers in a half dozen data centers in multiple countries.
A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers. EC2 provides advanced data center capabilities few companies can build on their own.
Yes, it can take some time and effort to learn this new way of working with on-demand, self-service, pay-as-you-go hardware infrastructure and sometimes the lessons are learned the hard way, but you’ll be better off in the end.