Amazon EC2 has been experiencing some power issues in a portion of one of their many data centers. Even though the relative percentage of people affected might be small, when you have as many customers as AWS does, a small fraction can still be a large absolute number of customers who are affected.
Naturally, some customers will be upset about not having access to their systems, but in the time it takes to write a complaint, you might be able to move your server to new hardware within EC2 and go on with your business
First, lets assume that you are running an EBS boot instance. If you didn’t think they were the way to go before reading this article, I expect to convince you with this one example (and there are a number of other benefits).
Setup
For this demo, I’m going to start an instance. In your situation this instance represents the currently running server that you are depending on and which has valuable software, configuration, and perhaps data on its EBS volume(s). I’m also going to drop a note on the EBS root disk on my running instance so that I know it is the one I wanted to preserve. Again, this is just setting things up for the demo:
# SKIP THIS ENTIRE SECTION IF YOU ALREADY HAVE AN INSTANCE RUNNING
keypair=YOURKEYPAIR
sshkey=.ssh/$keypair.pem # or wherever you keep it
region=us-west-1 # pick your region
zone=${region}a # pick your availability zone
type=m1.small # pick your size
amiid=ami-cb97c68e # Ubuntu 10.04 Lucid, 32-bit, EBS boot in that region
oldinstanceid=$(ec2-run-instances \
--key $keypair \
--region $region \
--availability-zone $zone \
--instance-type $type \
$amiid |
egrep ^INSTANCE | cut -f2)
echo "instanceid=$oldinstanceid"
while host=$(ec2-describe-instances --region $region "$oldinstanceid" |
egrep ^INSTANCE | cut -f4) && test -z $host; do echo -n .; sleep 1; done
echo host=$host
echo "save the volume" | ssh -i $sshkey ubuntu@$host tee README.txt
Moving to a New Instance
Now, we pretend that the above created instance has failed in some way and we can no longer access it. Here’s how we get our server running on a new instance:
If you can, stop the old instance. This increases your chance of keeping the file system consistent on the EBS volume. If the instance has really failed and this step does not work, then skip it.
ec2-stop-instances --region $region $oldinstanceidRun a new instance with the same startup parameters as your old instance.
newinstanceid=$(ec2-run-instances \ --key $keypair \ --region $region \ --availability-zone $zone \ --instance-type $type \ $amiid | egrep ^INSTANCE | cut -f2) echo "newinstanceid=$newinstanceid"Wait until the new instance is running and then stop (not terminate) the new instance and detach the EBS boot volume from it. Delete the volume as it has nothing of importance, having just been created.
ec2-stop-instances --region $region $newinstanceid newebsroot=$(ec2-describe-instances --region $region $newinstanceid | grep ^BLOCKDEVICE | grep /dev/sda1 | cut -f3) ec2-detach-volume --force --region $region $newebsroot ec2-delete-volume --region $region $newebsrootDetach the old (valuable) EBS root volume from the old (broken) instance.
oldebsroot=$(ec2-describe-instances --region $region $oldinstanceid | grep ^BLOCKDEVICE | grep /dev/sda1 | cut -f3) ec2-detach-volume --force --region $region $oldebsrootAttach the old (valuable) EBS root volume to the new (stopped) instance.
ec2-attach-volume \ --region $region \ -d /dev/sda1 \ -i $newinstanceid \ $oldebsrootIf you had multiple EBS volumes attached to the old instance, you would move each one over in a similar manner.
Restart the new instance which is now going to boot with the original volume.
ec2-start-instances --region $region $newinstanceid
Voila! You have moved your server from an old, perhaps broken instance to new (or at least different) hardware keeping the same file system, and it took only a few minutes! If you’d like, ssh to the new instance and make sure that your valuable information is still there.
If you had an Elastic IP address associated with the old instance, you would move it to the new instance.
Cleanup
You may terminate the old instance if you are comfortable that you won’t need it any more. If you were following this demo as an exercise, you should also terminate the new instance. Since you manually attached the old volume to the new instance yourself, it will not be deleted automatically when the instance is terminated. You can modify the instance attributes to change the delete-on-termination flag for the volume or simply delete it manually.
# BEWARE! Don't copy these blindly, but think about what you should do
ec2-terminate-instances --region $region $oldinstanceid
ec2-terminate-instances --region $region $newinstanceid
ec2-delete-volume --region $region $oldebsroot
Tips
This above process can also be used when your instance is running fine, but you want to move to a different instance type (size) of the same architecture. For example, you could move from m1.small up to c1.medium, or from m2.4xlarge down to c1.xlarge. Update: I wasn’t thinking clearly when I wrote that last sentence. It is possible to change the instance type much more easily: Simply stop the instance, use ec2-modify-instance-attributes, and start it up again.
You can also resize the root disk of a running EC2 instance using the same basic principle of swapping out an EBS root volume on a running instance.


As tested by Twitter user @schmidtcw, this technique works for Windows instances as well.
http://twitter.com/schmidtcw/statuses/13432697433
Instead of going to the trouble of creating a new instance and moving the boot volume, can't you just stop the old instance, change the instance type, and restart? It seems that should move you to new hardware even faster?
You should also note that bot of these processes will lose all the data on your ephemeral disks.
-tom
Tom: Yes, I caught this myself and updated the document before I received your comment, but thanks for pointing it out. Your observation on the local storage is also a good one. I tend not to think about it much since most everything I do is on EBS volumes these days.
Impecable timing. I received a notice yesterday about a hardware failure on one of my instances. Thanks to all your other tutorials and this helpful reminder, I was up and running on a different instance in no time.
The only catch for me was that I couldn't get the instance to stop, not could I detach the volumes. Fortunately I was able to create snapshots and go from there.
jedwood: ec2-detach-volume --force usually gets the job done, but I have seen cases where hardware failure prevents detaching. If you go the snapshot route, it might be easier to just register the snapshot as a new AMI and run an instance of it. My approach above (transferring the volume without a snapshot) is optimized for time to recovery, but snapshot AMIs are a bit easier.
As per my earlier comment, I am running a drupal site on an EBS Lucid instance. When I try to relaunch an AMI created from the original instance, nothing works - apache is not running, nor is mysql etc... Would it be because its still transferring data from s3?
I can't debug your issue from here, but if you post complete instructions on how to reproduce your problem to http://groups.google.com/group/ec2ubuntu somebody might have some ideas.
Hi Eric,
I did the dry run of your steps here. If I am able to stop the system than, I can detach all EBS volumes including Root vol. But, if I am not stopping the system, I am able to only detach non-root EBS vols which makes sense.
So, If I have a system which suddenly becomes inaccessible and I am not able to stop the system, I should be able to detach the non-root EBS volumes. Than, I can fire a new system with my private root partition image and attach this non-root volumes to the new system.
In case I am not able to detach any of the volumes, than we need to recover from snapshots of EBS volumes. The only issue here is that the system can not be used for production due to poor response time till all the blocks are copied from snapshots which can take upto 4-5 hrs for a 500GB vol.
Is there any better way to recover quickly if you are not able to salvage your EBS vols from inaccessible/hosed/failed system ?
thanks,
anil
anil: If you set up the root EBS volume to persist after instance termination, then you might be able to terminate the instance and force detach the EBS root volume.
If your project requires rapid recovery times, you might want to keep a hot spare standing by, updating live from the master, and ready for a fast failover, perhaps even automated.