Fixing Files on the Root EBS Volume of an EC2 Instance

You can examine and edit files on the root EBS volume on an EC2 instance even if you are in what you considered a disastrous situation like:

  • You lost your ssh key or forgot your password

  • You made a mistake editing the /etc/sudoers file and can no longer gain root access with sudo to fix it

  • Your long running instance is hung for some reason, cannot be contacted, and fails to boot properly

  • You need to recover files off of the instance but cannot get to it

On a physical computer sitting at your desk, you could simply boot the system with a CD or USB stick, mount the hard drive, check out and fix the files, then reboot the computer to be back in business.

A remote EC2 instance, however, seems distant and inaccessible when you are in one of these situations. Fortunately, AWS provides us with the power and flexibility to be able to recover a system like this, provided that we are running EBS boot instances and not instance-store.

The approach on EC2 is somewhat similar to the physical solution, but we’re going to move and mount the faulty “hard drive” (root EBS volume) to a different instance, fix it, then move it back.

In some situations, it might simply be easier to start a new EC2 instance and throw away the bad one, but if you really want to fix your files, here is the approach that has worked for many:

Identifying When a New EBS Volume Has Completed Initialization From an EBS Snapshot

On Amazon EC2, you can create a new EBS volume from an EBS snapshot using a command like

aws ec2 create-volume \
  --availability-zone us-east-1a \
  --snapshot-id snap-d53484bc

This returns almost immediately with a volume id which you can attach and mount on an EC2 instance. The EBS documentation describes the magic behind the immediate availability of the data on the volume:

“New volumes created from existing Amazon S3 snapshots load lazily in the background. This means that once a volume is created from a snapshot, there is no need to wait for all of the data to transfer from Amazon S3 to your Amazon EBS volume before your attached instance can start accessing the volume and all of its data. If your instance accesses a piece of data which hasn’t yet been loaded, the volume will immediately download the requested data from Amazon S3, and then will continue loading the rest of the volume’s data in the background."

This is a cool feature which allows you to start using the volume quickly without waiting for the blocks to be completely populated. In practice, however, there is a period of time where you experience high iowait when your application accesses disk blocks that need to be loaded. This manifests as somewhat to extreme slowness for minutes to hours after the creation of the EBS volume.

For some applications (mine) the initial slowness is acceptable, knowing that it will eventually pick up and perform with the normal EBS access speed once all blocks have been populated. As Clint Popetz pointed out on the ec2ubuntu group, other applications might need to know when the EBS volume has been populated and is ready for high performance usage.

Though there is no API status to poll or trigger to notify you when all the blocks on the EBS volume have been populated, it occurred to me that there was a method which could be used to guarantee that you are waiting until the EBS volume is ready: simply request all of the blocks from the volume and throw them away.

Here’s a simple command which reads all of the blocks on an EBS volume attached to device /dev/xvdX or /dev/sdX (substitute your device name) and does nothing with them:

sudo dd if=/dev/xvdX of=/dev/null bs=10M

OR

sudo dd if=/dev/sdX of=/dev/null bs=10M

By the time this command is done, you know that all of the data blocks have been copied from the EBS snapshot in S3 to the EBS volume. At this point you should be able to start using the EBS volume in your application without the high iowait you would have experienced if you started earlier. Early reports from Clint are that this method works in practice.

[Update 2012-04-02: We now have to support device names like sda1 and xvda1]

[Update 2019-08-07: New aws-cli command format]

Three Ways to Protect EC2 Instances from Accidental Termination and Loss of Data

Here are a few little-publicized benefits that were launched with Amazon EC2’s new EBS boot instances: You can lock them from being accidentally terminated; you can prevent termination even when you try to shutdown the server from inside the instance; and you can automatically save your data storage when they are terminated.

In discussions with other EC2 users, I’ve found that it is a common feeling of near panic when you go to terminate an instance and you check very carefully to make sure that you are deleting the right instance instead of an active production system. Slightly less common but even worse is the feeling of dread when you realize you just casually terminated the wrong EC2 instance.

It is always recommended to set up your AWS architecture so that you are able to restart production systems on EC2 easily, as they could, in theory, be lost at any point due to hardware failure or other actions. However, new features released with the EBS boot instances help reduce the risks associated with human error.

The following examples will demonstrate with the EC2 API command line tools ec2-run-instances, ec2-modify-instance-attribute, and ec2-terminate-instances. Since AWS is Amazon’s “web service” all of these features are also available through the API and should be coming available (if they aren’t already) in other tools, packages, and interfaces using the web service API.

  1. Shutdown Behavior

First, let’s look at what happens when you run a command like the following in an EC2 instance:

sudo shutdown -h now
# or, equivalently and much easier to type:
sudo halt

Using the legacy S3 based AMIs, either of the above terminates the instance and you lose all local and ephemeral storage (boot disk and /mnt) forever. Hope you remembered to save the important stuff elsewhere!

A shutdown from within an EBS boot instance, by default, will initiate a “stop” instead of a “terminate”. This means that your instance is not in a running state and not getting charged, but the EBS volumes still exist and you can “start” the same instance id again later, losing nothing.

You can explicitly set or change this with the --instance-initiated-shutdown-behavior option in ec2-run-instances. For example:

ec2-run-instances \
  --instance-initiated-shutdown-behavior stop \
  [...]

This is the first safety precaution and, as mentioned, should already be built in, though it doesn’t hurt to document your intentions with an explicit option.

  1. Delete on Termination

Though EBS volumes created and attached to an instance at instantiation are preserved through a “stop”/“start” cycle, by default they are destroyed and lost when an EC2 instance is terminated. This behavior can be changed with a delete-on-termination boolean value buried in the documentation for the --block-device-mapping option of ec2-run-instances.

Here is an example that says “Don’t delete the root EBS volume when this instance is terminated”:

ec2-run-instances \
  --block-device-mapping /dev/sda1=::false \
  [...]

If you are associating other EBS snapshots with the instance at run time, you can also specify that those created EBS volumes should be preserved past the lifetime of the instance:

  --block-device-mapping /dev/sdh=SNAPSHOTID::false

When you use these options, you’ll need to manually clean up the EBS volume(s) if you no longer want to pay for the storage costs after an instance is gone.

Note: EBS volumes attached to an instance after it is already running are, by default, left alone on termination (i.e., not deleted). The default rules are: If the EBS volume is created by the creation of the instance, then the termination of the instance deletes the volumes. If you created the volume explicitly, then you must delete it explicitly.

  1. Disable Termination

This is my favorite new safety feature. Years ago, I asked Amazon for the ability to lock an EC2 instance from being accidentally terminated, and with the launch of EBS boot instances, this is now possible. Using ec2-run-instances, the key option is:

ec2-run-instances \
  --disable-api-termination \
  [...]

Now, if you try to terminate the instance, you will get an error like:

Client.OperationNotPermitted: The instance 'i-XXXXXXXX' may not be terminated.
Modify its 'disableApiTermination' instance attribute and try again.

Yay!

To unlock the instance and allow termination through the API, use a command like:

ec2-modify-instance-attribute \
  --disable-api-termination false \
  INSTANCEID

Then end it all with:

ec2-terminate-instances INSTANCEID

Oh, wait! did you make sure you unlocked and terminated the right instance?! :)

Note: disableApiTermination is also available when you run S3 based AMIs today, but since the instance can still be terminated from inside (shutdown/halt) I am moving towards EBS based instances for all around security.

Put It Together

Here’s a command line I just used to start up an EC2 instance of an Ubuntu 9.10 Karmic EBS boot AMI which I intend to use as a long-term production server with strong uptime and data safety requirements. I wanted to add all the protection available:

ec2-run-instances \
  --key $keypair \
  --availability-zone $availabilityzone \
  --user-data-file $startupscript \
  --block-device-mapping /dev/sda1=::false \
  --block-device-mapping /dev/sdh=$snapshotid::false \
  --instance-initiated-shutdown-behavior stop \
  --disable-api-termination \
  $amiid

With regular EBS snapshots of the volumes, copies to off site backups, and documented/automated restart processes, I feel pretty safe.

Runtime Modification

If you have a valuable running EC2 instance, but forgot to specify the above options to protect it when you started it, or you are now ready to turn a test system into a production system, you can still lock things after the fact using the ec2-modify-instance-attribute command (or equivalent API call).

For example:

ec2-modify-instance-attribute --disable-api-termination true INSTANCEID
ec2-modify-instance-attribute --instance-initiated-shutdown-behavior stop INSTANCEID
ec2-modify-instance-attribute --block-device-mapping /dev/sda1=:false INSTANCEID
ec2-modify-instance-attribute --block-device-mapping /dev/sdh=:false INSTANCEID

Notes:

  • Only one type of option can be specified with each invocation.

  • The --disable-api-termination option has no argument value when used in in ec2-run-instances, but it takes a value of true or false in ec2-modify-instance-attribute.

  • You don’t specify the snapshot id when changing the delete-on-terminate state of an attached EBS volume.

  • You may change the delete-on-terminate to “true” for an EBS volume you created and attached after the instance was running. By default it will not be deleted since you created it.

  • The above --block-device-mapping option requires recent ec2-api-tools to avoid a bug.

You can find out the current state of the options using ec2-describe-instance-attribute, which only takes one option at a time:

ec2-describe-instance-attribute --disable-api-termination INSTANCEID
ec2-describe-instance-attribute --instance-initiated-shutdown-behavior INSTANCEID
ec2-describe-instance-attribute --block-device-mapping INSTANCEID

Unfortunately, the block-device-mapping output does not currently show the state of delete-on-termination value, but thanks to Andrew Lusk for pointing out in a comment below that it is available through the API. Here’s a hack which extracts the information from the debug output:

ec2-describe-instance-attribute -v --block-device-mapping INSTANCEID | 
  perl -0777ne 'print "$1\t$2\t$3\n" while 
  m%<deviceName>(.*?)<.*?<volumeId>(.*?)<.*?<deleteOnTermination>(.*?)<%sg'

While we’re on the topic of EC2 safety, I should mention the well known best practice of separating development and production systems by using a different AWS account for each. Amazon lets you create and use as many accounts as you’d like even with the same credit card as long as you use a unique email address for each. Now that you can share EBS snapshots between accounts, this practice is even more useful.

What additional safety precautions do you take with your EC2 instances and data?

[Update 2011-09-13: Corrected syntax for modifying block-device-mapping on running instance (only one “:")]

New --mysql-stop option for ec2-consistent-snapshot

The ec2-consistent-snapshot software tries its best to flush and lock a MySQL database on an EC2 instance while it initiates the EBS snapshot, and for many environments it does a pretty good job.

However, there are situations where the database may spend time performing crash recovery from the log file when it is started from a copy of the snapshot. We are seeing this behavior at CampusExplorer.com where the database is constantly active and we have innodb_log_file_size set (probably too) high. The delay is doubtless exacerbated by the fact that the blocks on the new EBS volume are being recovered from S3 as it is being built from the snapshot.

Google has created an innodb_disallow_writes MySQL patch which I think points out the problem we may be hitting.

“Note that it is not sufficient to run FLUSH TABLES WITH READ LOCK as there are background IO threads used by InnoDB that may still do IO."

It would be very nice to have this patch incorporated in MySQL on Ubuntu. It looks like the OurDelta folks have already incorporated the patch. [Update: See rsimmons' comment below which explains why this particular patch might not be the answer.]

In any case, when we bring up a database using an EBS volume created from an EBS snapshot of an active database, it can take up to 45 minutes recovering before it lets normal clients connect. This is too long for us so we’re trying a new approach.

The ec2-consistent-snapshot now has a --mysql-stop option which shuts down the MySQL server, initiates the snapshot, and then restarts the database. Our hope is that this will get us a snapshot which can be restored and run without delay. If any MySQL experts can point out the potential flaws in this, please do.

Since we obviously can’t stop and start our production database every hour, we are performing this snapshot activity on a replication slave that is dedicated to snapshots and backups.

We continue to perform occasional snapshots on the production database EBS volume just to help keep it reliable per Amazon’s instructions, but we don’t expect to be able to restore it without crash recovery.

If you’d like to test the new --mysql-stop option, please upgrade your ec2-consistent-snapshot package from the Alestic PPA and let me know how it goes.

Creating Consistent EBS Snapshots with MySQL and XFS on EC2

In the article Running MySQL on Amazon EC2 with Elastic Block Store I describe the principles involved in using EBS on EC2. Though originally published in 2008, it is still relevant today and is worth reviewing to get context for this article.

In the above tutorial, I included a sample script which followed the basic instructions in the article to initiate EBS snapshots of an XFS file system containing a MySQL database. For the most part this script worked for basic installations with low volume.

Over the last year as I and my co-workers have been using this code in production systems, we identified a number of ways it could be improved. Or, put another way, some serious issues came up when the idealistic world view of the original simplistic script met the complexities which can and do arise in the brutal real world.

We gradually improved the code over the course of the year, until the point where it has been running smoothly on production systems with no serious issues. This doesn’t mean that there aren’t any areas left for improvement, but does seem like it’s ready for the general public to give it a try.

The name of the new program is ec2-consistent-snapshot.

Features

Here are some of the ways in which the ec2-consistent-snapshot program has improved over the original:

  • Command line options for passing in AWS keys, MySQL access information, and more.

  • Can be run with or without a MySQL database on the file system. This lets you use the command to initiate snapshots for any EBS volume.

  • Can be used with or without XFS file systems, though if you don’t use XFS, you run the risk of not having a consistent file system on EBS volume restore.

  • Instead of using the painfully slow ec2-create-snapshot command written in Java, this Perl program accesses the EC2 API directly with orders of magnitude speed improvement.

  • A preliminary FLUSH is performed on the MySQL database before the FLUSH WITH READ LOCK. This preparation reduces the total time the tables are locked.

  • A preliminary sync is performed on the XFS file system before the xfs_freeze. This preparation reduces the total time the file system is locked.

  • The MySQL LOCK now has timeouts and retries around it. This prevents horrible blocking interactions between the database lock, long running queries, and normal transactions. The length of the timeout and the number of retries are configurable with command line options.

  • The MySQL FLUSH is done in such a way that the statement does not propagate through to slave databases, negatively impacting their performance and/or causing negative blocking interactions with long running queries.

  • Cleans up MySQL and XFS locks if it is interrupted, if a timeout happens, or if other errors occur. This prevents a number of serious locking issues when things go wrong with the environment or EC2 API.

  • Can snapshot EBS volumes in a region other than the default (e.g., eu-west-1).

  • Can initiate snapshots of multiple EBS volumes at the same time while everything is consistently locked. This has been used to create consistent snapshots of RAIDed EBS volumes.

Installation

On Ubuntu, you can install the ec2-consistent-snapshot package using the new Alestic PPA (personal package archive) hosted on Launchpad.net. Here are the steps to set up access to packages in the Alestic PPA and install the software package and its dependencies:

sudo add-apt-repository ppa:alestic &&
sudo apt-get update &&
sudo apt-get install -y ec2-consistent-snapshot

Now you can read the documentation using:

man ec2-consistent-snapshot

and run the ec2-consistent-snapshot command itself.

Feedback

If you find any problems with ec2-consistent-snapshot, please create bug reports in launchpad. The same mechanism can be used to submit ideas for improvement, which are especially welcomed if you include a patch.

Other questions and feedback are accepted in the comments section for this article. If you’re reading this on a planet, please click through on the title to read the comments.

[Update 2011-02-09: Simplify install instructions to not require use of CPAN.]

Hidden Dangers in Creating Public EBS Snapshots on EC2

Amazon EC2 recently released a feature which lets you share an EBS snapshot so that other accounts can access it. The snapshot can be shared with specific individual accounts or with the public at large.

You should obviously be careful what files you put on a shared EBS snapshot because other people are going to be able to read them. What may not be so obvious to is that you also need to be wary of what files are not currently on the snapshot but once were.

For example, if you copied some files onto the EBS volume, then realized a few contained sensitive information, you might think it’s sufficient to delete the private files and continue on to create a public EBS snapshot of the volume.

The problem with this is that EBS is an elastic block store device, not an interface at the file system level. Any block which was once written to on the block device will be available on the shared EBS volume, even if it is not being used by a visible file on the file system.

Since popular Linux file systems do not generally wipe data when a file is deleted, it is often possible to recover the contents of the deleted files. Even attempting to overwrite a file may, depending on the application, leave the original content available on the disk.

This means any content that touched your EBS volume at any point may still be available to users of your shared EBS snapshot.

To be clear: I do not consider this to be a security flaw in EC2 or EBS. It is merely a security risk for people who do not understand and take precautions against the combination of interactions with file systems, block devices, EBS volumes, and snapshots.

$100 Reward

To demonstrate the security risk, I have created a simple challenge with a tangible reward. Here is a public EBS snapshot:

snap-d53484bc

This EBS snapshot contains two files. The first file is README-1.txt which has nothing sensitive in it but will let you know that you’ve got the right device mounted on your EC2 instance.

The second file created on the source EBS volume contained an Amazon.com gift certificate for $100. I deleted this second file, then took an EBS snapshot of the volume and released it to the public.

The first person who successfully recovers the deleted file on this shared EBS snapshot and enters the gift certificate code into their Amazon.com account will win the $100 prize. Subsequent solvers will get a notice from Amazon that the certificate has already been redeemed, but you still get credit for solving it and helping demonstrate the risks.

Feel free to post a comment on this blog entry if you recovered the deleted file on the shared EBS snapshot. Recipes for doing so are welcomed even if you were not the first. I tested this, so know it’s possible and that the deleted file is still accessible (but I did not redeem the gift certificate, of course).

Good luck!

Using RAID on EC2 EBS Volumes to Break the 1TB Barrier and Increase Performance

Amazon EC2 currently has a limit of 1,000 GB (1 TB) for EBS volumes (Elastic Block Store). It is possible to create file systems larger than this limit using RAID 0 across multiple EBS volumes. Using RAID 0 can also improve the performance of the file system reducing total IO wait as demonstrated in a number of published EBS performance tests.

The following instructions walk through one way to set up RAID 0 across multiple EBS volumes. Note that there is a limit on the size of a file system on 32-bit instances, but 64-bit instances can get unreasonably large. This test was run with 40 EBS volumes of 1,000 GB each for a total of 40,000 GB (40 TB) in the resulting file system.

Actual command line output showing the size of the RAID:

# df /vol
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0             41942906368      1312 41942905056   1% /vol

# df -h /vol
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               40T  1.3M   40T   1% /vol

These commands can run in less than 10 minutes and this could probably be reduced further by parallelizing the creation and attaching of the EBS volumes.

Note that the default limit is 20 EBS volumes per EC2 account. You can request an increase from Amazon if you need more.

Caution: 40 TB of EBS storage on EC2 will cost $4,000 per month plus usage charges.

Instructions

Start a 64-bit instance (say, Ubuntu 8.04 Hardy from https://alestic.com). Use your own KEYPAIR:

ec2-run-instances \
  --key KEYPAIR \
  --instance-type c1.xlarge \
  --availability-zone us-east-1a \
  ami-0772946e

Configurable parameters (set on both local host and on EC2 instance):

instanceid=i-XXXXXXXX
volumes=40
size=1000
mountpoint=/vol

On the local host (with EC2 API tools installed)…

Create and attach EBS volumes:

devices=$(perl -e 'for$i("h".."k"){for$j("",1..15){print"/dev/sd$i$j\n"}}'|
           head -$volumes)
devicearray=($devices)
volumeids=
i=1
while [ $i -le $volumes ]; do
  volumeid=$(ec2-create-volume -z us-east-1a --size $size | cut -f2)
  echo "$i: created  $volumeid"
  device=${devicearray[$(($i-1))]}
  ec2-attach-volume -d $device -i $instanceid $volumeid
  volumeids="$volumeids $volumeid"
  let i=i+1
done
echo "volumeids='$volumeids'"

On the EC2 instance (after setting parameters as above)…

Install software:

sudo apt-get update &&
sudo apt-get install -y mdadm xfsprogs

Set up the RAID 0 device:

devices=$(perl -e 'for$i("h".."k"){for$j("",1..15){print"/dev/sd$i$j\n"}}'|
           head -$volumes)

yes | sudo mdadm \
  --create /dev/md0 \
  --level 0 \
  --metadata=1.1 \
  --chunk 256 \
  --raid-devices $volumes \
  $devices

echo DEVICE $devices       | sudo tee    /etc/mdadm.conf
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm.conf

Create the file system (pick your preferred file system type)

sudo mkfs.xfs /dev/md0

Mount:

echo "/dev/md0 $mountpoint xfs noatime 0 0" | sudo tee -a /etc/fstab
sudo mkdir $mountpoint
sudo mount $mountpoint

Check it out:

df -h $mountpoint

When you’re done with it and want to destroy the data and stop paying for storage, tear it down:

sudo umount $mountpoint
sudo mdadm --stop /dev/md0

Terminate the instance:

sudo shutdown -h now

On the local host (with EC2 API tools installed)…

Detach and delete volumes:

for volumeid in $volumeids; do
  ec2-detach-volume $volumeid
done

for volumeid in $volumeids; do
  ec2-delete-volume $volumeid
done

Credits

This article was originally posted on the EC2 Ubuntu group.

Thanks to M. David Peterson for the basic mdadm instructions:

[Update 2012-01-21: Added –chunk 256 based on community recognized best practices.]

Keeping File Ownership (UIDs) Consistent when Using EBS on EC2

Persistent storage on Amazon EC2 is accomplished through the use of Elastic Block Store (EBS) volumes. EBS is basically a storage area network (SAN) and can be thought of as an on-demand, virtual, redundant hard drive plugged in to the server with super-powers like snapshot/restore.

An EBS volume can be detached from one EC2 instance and attached to another. You can create a snapshot of an EBS volume and create new volumes from the snapshot to attach to other instances. Though this flexibility provides some useful abilities, it also presents some challenges.

In particular, the files stored on the EBS volume will be owned by specific numeric UIDs (users) and GIDs (groups). When you fire up and configure a new instance, the UIDs and GIDs on the EBS volume may not exactly match the numeric ids of the users and groups on the new instance, depending on how you set it up.

For example, when you install the MySQL software, the package will generally create a new “mysql” user with the next available UID. If you don’t create the various users in exactly the same order on new instances, you may end up with your database files owned by the “postfix” user instead of the “mysql” user. It’s happened to me and I’m not the only one.

There is a discussion about this topic on the ec2ubuntu Google Group and it has also been raised on Canonical’s EC2 beta mailing list.

Here are some of the different approaches to avoiding or fixing this problem:

  1. Bundle your own AMIs and always run instances of the same AMI when attaching EBS volumes with files. This works if you already have to bundle your AMIs for other reasons, but I often recommend against AMI rebundling because of the efforts involved, lack of reproducibility, and maintenance problems when the base image gets updated or has bugs fixed.

  2. Automate the creation of users and installation of packages in exactly the same order every time. This is likely to give you the same UID/GID values for each user, but it starts to get messy if you end up with an order mixing human users and software package users:

  3. Create all users/groups with hardcoded UIDs/GIDs before installing software packages. If you automate the creation of users and groups you can force the “mysql” and “postfix” users to have a specific UID value. Then you install the MySQL and Postfix packages and the software will use the users which already exist on the system. We ended up following this approach with our EC2 servers at CampusExplorer.com

  4. Correct the ownership of files after mounting the EBS volume. This feels a bit messy to me, but it might be the only option in some cases. I must admit that I’ve done this manually a number of times, but only after finding problems like MySQL not starting because the files aren’t owned by the correct user. For example, say you needed to change files currently owned by “postfix” to be correclty owned by “mysql”:

     find /vol -user postfix -print0 | xargs -0 chown mysql
    

    If you are changing ownership of files after mounting the EBS volume, make sure you do it in an order which does not lose information. For example, if you have to swap “postfix” and “mysql” users, you’ll need to use a temporary third UID as a placeholder.

  5. On the ec2ubuntu Google group it was suggested that a central authority might be a way to solve the problem. I’ve never used this approach on Linux and am not sure how much work it would be setting up a reliable service like this on EC2.

No matter what approach you use, it might be a good idea to add in some checks after you mount an EBS volume to make sure that the files are owned by the appropriate users. For example, you might verify that the mysql directory is owned by the mysql user

Solving this problem is something that I have only begun to work on, so I would appreciate any comments, pointers, and solutions that you may have.