New Release of ec2-consistent-snapshot and Screencast by Ahmed Kamal

ec2-consistent-snapshot is a tool that uses the Amazon EC2 API to initiate a snapshot of an EBS volume with some additional work to help ensure that an XFS file system and/or MySQL database are in a consistent state on that snapshot.

Ahmed Kamal pointed out to me yesterday that we can save lots of trouble installing ec2-consistent-snapshot by adding a dependency on the new libnet-amazon-ec2-perl package in Ubuntu instead of forcing people to install the Net::Amazon::EC2 Perl package through CPAN (not easy for the uninitiated).

I released a new version of ec2-consistent-snapshot which has this new dependency and updated documentation. Installing this software on Ubuntu 10.04 Lucid, 10.10 Maverick, and the upcoming Natty release is now as easy as:

New --mysql-stop option for ec2-consistent-snapshot

The ec2-consistent-snapshot software tries its best to flush and lock a MySQL database on an EC2 instance while it initiates the EBS snapshot, and for many environments it does a pretty good job.

However, there are situations where the database may spend time performing crash recovery from the log file when it is started from a copy of the snapshot. We are seeing this behavior at CampusExplorer.com where the database is constantly active and we have innodb_log_file_size set (probably too) high. The delay is doubtless exacerbated by the fact that the blocks on the new EBS volume are being recovered from S3 as it is being built from the snapshot.

Google has created an innodb_disallow_writes MySQL patch which I think points out the problem we may be hitting.

“Note that it is not sufficient to run FLUSH TABLES WITH READ LOCK as there are background IO threads used by InnoDB that may still do IO."

It would be very nice to have this patch incorporated in MySQL on Ubuntu. It looks like the OurDelta folks have already incorporated the patch. [Update: See rsimmons' comment below which explains why this particular patch might not be the answer.]

In any case, when we bring up a database using an EBS volume created from an EBS snapshot of an active database, it can take up to 45 minutes recovering before it lets normal clients connect. This is too long for us so we’re trying a new approach.

The ec2-consistent-snapshot now has a --mysql-stop option which shuts down the MySQL server, initiates the snapshot, and then restarts the database. Our hope is that this will get us a snapshot which can be restored and run without delay. If any MySQL experts can point out the potential flaws in this, please do.

Since we obviously can’t stop and start our production database every hour, we are performing this snapshot activity on a replication slave that is dedicated to snapshots and backups.

We continue to perform occasional snapshots on the production database EBS volume just to help keep it reliable per Amazon’s instructions, but we don’t expect to be able to restore it without crash recovery.

If you’d like to test the new --mysql-stop option, please upgrade your ec2-consistent-snapshot package from the Alestic PPA and let me know how it goes.

Creating Consistent EBS Snapshots with MySQL and XFS on EC2

In the article Running MySQL on Amazon EC2 with Elastic Block Store I describe the principles involved in using EBS on EC2. Though originally published in 2008, it is still relevant today and is worth reviewing to get context for this article.

In the above tutorial, I included a sample script which followed the basic instructions in the article to initiate EBS snapshots of an XFS file system containing a MySQL database. For the most part this script worked for basic installations with low volume.

Over the last year as I and my co-workers have been using this code in production systems, we identified a number of ways it could be improved. Or, put another way, some serious issues came up when the idealistic world view of the original simplistic script met the complexities which can and do arise in the brutal real world.

We gradually improved the code over the course of the year, until the point where it has been running smoothly on production systems with no serious issues. This doesn’t mean that there aren’t any areas left for improvement, but does seem like it’s ready for the general public to give it a try.

The name of the new program is ec2-consistent-snapshot.

Features

Here are some of the ways in which the ec2-consistent-snapshot program has improved over the original:

  • Command line options for passing in AWS keys, MySQL access information, and more.

  • Can be run with or without a MySQL database on the file system. This lets you use the command to initiate snapshots for any EBS volume.

  • Can be used with or without XFS file systems, though if you don’t use XFS, you run the risk of not having a consistent file system on EBS volume restore.

  • Instead of using the painfully slow ec2-create-snapshot command written in Java, this Perl program accesses the EC2 API directly with orders of magnitude speed improvement.

  • A preliminary FLUSH is performed on the MySQL database before the FLUSH WITH READ LOCK. This preparation reduces the total time the tables are locked.

  • A preliminary sync is performed on the XFS file system before the xfs_freeze. This preparation reduces the total time the file system is locked.

  • The MySQL LOCK now has timeouts and retries around it. This prevents horrible blocking interactions between the database lock, long running queries, and normal transactions. The length of the timeout and the number of retries are configurable with command line options.

  • The MySQL FLUSH is done in such a way that the statement does not propagate through to slave databases, negatively impacting their performance and/or causing negative blocking interactions with long running queries.

  • Cleans up MySQL and XFS locks if it is interrupted, if a timeout happens, or if other errors occur. This prevents a number of serious locking issues when things go wrong with the environment or EC2 API.

  • Can snapshot EBS volumes in a region other than the default (e.g., eu-west-1).

  • Can initiate snapshots of multiple EBS volumes at the same time while everything is consistently locked. This has been used to create consistent snapshots of RAIDed EBS volumes.

Installation

On Ubuntu, you can install the ec2-consistent-snapshot package using the new Alestic PPA (personal package archive) hosted on Launchpad.net. Here are the steps to set up access to packages in the Alestic PPA and install the software package and its dependencies:

sudo add-apt-repository ppa:alestic &&
sudo apt-get update &&
sudo apt-get install -y ec2-consistent-snapshot

Now you can read the documentation using:

man ec2-consistent-snapshot

and run the ec2-consistent-snapshot command itself.

Feedback

If you find any problems with ec2-consistent-snapshot, please create bug reports in launchpad. The same mechanism can be used to submit ideas for improvement, which are especially welcomed if you include a patch.

Other questions and feedback are accepted in the comments section for this article. If you’re reading this on a planet, please click through on the title to read the comments.

[Update 2011-02-09: Simplify install instructions to not require use of CPAN.]

Solving: "I can't connect to my server on Amazon EC2"

Help! I can’t connect to my EC2 instance!
Woah! My box just stopped talking to me!
Hey! I can’t access the server!

These and other variations on the connectivity theme are some of the most common problems raised on the Amazon EC2 forum.

The EC2 community and Amazon employees do a valiant job helping users track down and solve these issues despite the facts that (1) there are hundreds of reasons why a server or service might not be accessible, (2) connectivity is one of the harder problems to diagnose, especially without being hands-on, and (3) users complaining about a problem generally don’t provide the clues necessary to solve the issue (because the ones who knew what those clues were probably solved it themselves and didn’t post).

This article is an attempt to provide some general assistance to folks who are experiencing connectivity issues with Amazon EC2. Please post additional help in the comments; this document will be updated over time.

Questions

First off, you should understand that it’s ok to ask for help. When you do, though, you should provide as many details as possible about what you are trying to do and what results you are seeing. It also helps if you drop some clues about your level of expertise. A person using Linux for the first time is likely to make different mistakes on EC2 than a person who is having problems connecting to a custom AMI they built from scratch.

The more specific you can be about your problem and the more information you can provide, the more likely somebody will be able to help. Here are some common questions which are important to have answered for connectivity problems on EC2:

  1. When you say you “can’t connect” what application are you trying to use and on what port? For example: “ssh to port 22” or “accessing port 80 with Firefox”. If you don’t know what a port is, then provide as many details as possible about the application you’re using and what command or steps you are taking to initiate the connection.

  2. What, specifically, happens when you try to connect? Does it hang for a long time and eventually time out? Do you get an error message? What is the exact text you see? (Copy and paste, don’t summarize.)

  3. What is the AMI id which the instance is running? If it is not a public AMI, then what is the AMI id of the public AMI it is based on?

  4. What Linux distro and release is the instance running? E.g., Ubuntu 9.04 Jaunty, Debian etch, Fedora 8, CentOS 5.

  5. What is the instance id of the instance you are trying to contact? Providing this can let Amazon employees take a look at the internals of what might be going on.

  6. What are the internal and external IP addresses and/or host names for the instance you are trying to reach? Providing this information is, in effect, giving permission to the community to try to contact your server over the network so that they can gather information about connectivity and help solve your problem.

  7. Have you ever been able to contact this instance in the past? How recently?

  8. How long has the instance been running?

  9. Have you ever been able to contact another instance of the same AMI?

  10. Is there a difference in connectivity when you try from another EC2 instance instead of from the Internet?

  11. What were you doing when the connectivity stopped?

  12. What is the console output of the instance? You can get this through an API client or a command like:

     ec2-get-console-output INSTANCE_ID
    

There are so many reasons that connectivity might be down to a remote server or service that it would be impossible to get a significant percentage of them listed in one article. I’ll start by listing some of the more common problems here; please add to the comments as you run into or remember others.

You

By far the most common cause of the problem is you (the person experiencing the problem) and that’s ok. We all make mistakes. It’s important, though, that you start with this attitude: open to the possibilities that you typed something wrong, forgot a step, or didn’t quite understand the complex instructions. Ninety percent of the people reading this paragraph think I’m talking to somebody else; oddly, they also think this sentence is not about them.

Here are some of the most common reasons folks (including me) can’t connect to their Amazon EC2 instance. Really.

  • You’re not connecting to the right instance or to the instance you think you’re trying to connect to. Servers on EC2 are identified by opaque instance ids like i-ae1df2c6 and opaque host names like ec2-75-101-182-20.compute-1.amazonaws.com. It’s easy for anybody to get these confused or mistype them.

  • The instance you’re trying to connect to has not completed the boot process yet. Though some AMIs are ready to connect in under a minute, others can take 10+ minutes.

  • The instance you’re trying to connect to has been terminated. (Did you just shut down what you thought was a different instance?)

  • The service you are trying to reach on the instance is not running on that instance.

  • The service you are trying to reach on the instance is not listening on that port or that network interface.

  • You did not open the port in the security group.

  • You did not start the instance with the correct security group.

  • You did not start the instance with the same ssh keypair as you are using to access it.

  • Your local firewall is preventing you from getting out to that port on any server outside your network. Talk to your local network administrators.

  • Your firewall on the instance is preventing access to the service. Try shutting down iptables temporarily to see if that helps.

You “experts” laugh when you read these, but if you’re having trouble reaching a server, I recommend you go through each one carefully and double check that your assumptions are correct and the world is really as you remember it. Remember: We all make mistakes. A lot of these come from personal experience.

If you’re not quite sure what terms like “security group” and “keypair” mean in the EC2 context, I recommend going back and reading some introductory material. These are important concepts for beginners.

ssh

The ssh connectivity problems generally fall into a couple major buckets

  1. ssh is not accessible, or

  2. ssh is rejecting the connection due to a failure to authenticate or authorize

You can find out which type of problem you have by using a command like

telnet HOSTNAME 22

If this connects, then ssh is running and accepting connections on port 22. (Hit [Enter] a couple times to disconnect from the telnet session). If you don’t connect, then it’s important to note if the attempt basically hung forever or if you got a “Connection refused” type of message immediately. (Hit [Ctrl]-[C] to stop the telnet command.)

If the connection attempt hangs, then there might be a problem with the security group, iptables, or your instance might not be running at that IP address.

If the telnet connection attempt gets rejected, then there might be a problem with iptables, ssh configuration, ssh not running on the instance, or perhaps it’s listening on different port if the admin likes to configure things a bit more securely. The console output can be helpful in determining if sshd was started at boot.

If you can get connected to the ssh port with telnet, then you need to start debugging why ssh is not letting you in. The most important information can be gathered by running the ssh connection attempt in verbose (-v) mode:

ssh -v -i KEYPAIR.pem USERNAME@HOSTNAME

The complete output of this command can be very helpful to post when asking for help.

The most common problems with ssh relate to:

  • Forgetting to specify -i KEYPAIR.pem in the ssh command

  • Not starting the instance specifying a keypair

  • Using a different keypair than the one which was used to start the instance

  • Not ssh’ing with the correct username. Amazon Linux instances require a first ssh connection with ec2-user@... while mages published by Canonical require a first connection with ubuntu@... and others might need you to ssh using root@...

  • Not having the correct ownership or mode on the .ssh directory or authorized_keys file.

  • Not having the correct Allow* or *Authentication settings in /etc/ssh/sshd_config

Apache

Web servers are much easier to connect to than other applications because there is generally no authentication and authorization involved to get a basic web page. If you can’t reach your web server on EC2, then it’s generally one of the simple problems described above like using the wrong IP address, trying to reach a terminated instance, or not having the web port opened in the security group.

MySQL

The most common problem specific to MySQL connectivity on EC2 is the fact that MySQL is configured securely by default to not allow access by remote hosts. If you need to allow a connection from your other instances running in EC2, then edit /etc/mysql/my.cnf and replace this line:

bind-address            = 127.0.0.1

with

bind-address            = 0.0.0.0

and restart the mysqld server.

IMPORTANT! You should not open the MySQL port in the EC2 security group. You only want your own EC2 instances to connect to the database and the default security group allows your EC2 instances to connect to any port on your other EC2 instances. If you open up the port to the public, then your database will be attacked by the Internet at large.

If you need to talk to your MySQL database running on EC2 from a server running outside EC2, then do it over a secure channel like an ssh tunnel or openvpn. You don’t need the MySQL port open in the security group to do this. The MySQL protocol is not by itself encrypted and your usernames and passwords would be sent in the clear for anybody else to intercept if you didn’t talk over a secure channel.

Custom AMIs

If you are building your own custom AMIs from scratch, then there are a number of complicated barriers to getting network and ssh connectivity working. Unfortunately it is nearly impossible to debug these problems since you don’t have access to the machine to see what went wrong. Console output is your only friend in these cases.

Here are some examples of odd things which others in the EC2 community have run into and solved:

  • Make sure you start networking on instance boot. It should come up with DHCP on eth0.

  • Make sure your Linux distro does not save the MAC address somewhere, preventing the network from functioning in the next instance. Ubuntu stores this in the /etc/udev/rules.d/70-persistent-net.rules file and Debian stores this in the /etc/udev/rules.d/z25_persistent-net.rules file.

  • Make sure your image downloads the ssh keypair and installs it in authorized_keys.

  • Make sure you have the right devices created and file systems mounted.

  • Make sure you’re using a udev lower than v144 as higher versions are incompatible with Amazon’s 2.6.21 kernel.

  • Make sure you’re using the right libc6 and related configurations including /lib/tls

Amazon

I realize this was your first thought, but it’s such a rare cause, I’ve put it here at the end. Sometimes there are problems with Amazon EC2. The hardware running your instance may fail or the networks might have temporary glitches. There are a couple different classes of problems here:

  1. Small scale problems local to the hardware running your instance. Though these are rare for any single instance, they are happening all the time for some customer somewhere given that AWS has hundreds of thousands of customers. Amazon often sends you an email when they notice that an instance is starting to have problems, and you should move to a new instance as soon as possible. If the failure happens without the warning, the only solution is to move to a new instance anyway, so you should always be prepared to do this.

  2. Large scale problems which affect a large number of customers simultaneously. These are very rare, and generally don’t affect more than a single availability zone given the way that Amazon has spread out the risk in their architecture.

You can check the AWS service health dashboard to see if Amazon is aware of any widespread problems with the EC2 service. If there are problems with a specific availability zone, you may want to move your servers to a different availability zone until the issues get resolved.

First Responses

For general cases where you can’t immediately figure out what went wrong with the connectivity, here are two things which are almost always recommended on EC2: reboot the instance and replace the instance.

Reboot your EC2 instance using the EC2 Console, another API client, or a command like:

    ec2-reboot-instances INSTANCE_ID

After giving it sufficient time to come up, see if that fixed the connectivity problem. Do not reboot your instance if you currently have a working ssh connection to it, but other ssh connections are failing!

If you have a production service running on Amazon EC2 and you lose connectivity to an instance, then I recommend your first reaction be to kick off a replacement instance so that it boots and configures itself while you investigate the original issue. If you don’t solve the problem by the time the replacement is ready, simply switch over to the new server. You may want to continue investigating what happened with the old server, though I generally don’t care what the problem was unless it happens more than once or twice in a short time period.

If your installation environment does not allow you to easily start replacement instances, then you should reconsider how you are using EC2 and work to improve this.

Seeking Help

If the above did not help you solve your problem reaching your EC2 instance, you may want to reach out to the community including some AWS employees on the EC2 forum.

Amazon also has premium AWS support available.

Requests for connectivity help by posting a comment on this particular thread will not be published or answered. Please only post a comment if you have corrections or additional information to share for users experiencing problems. I do occasionally receive and respond to questions posted on other articles, but for this topic, please use the EC2 forum.

[Update 2011-10-01: Amazon Linux requires ssh using ubuntu@...]

EBS Snapshots of a MySQL Slave Database on EC2

At our company, CampusExplorer.com, we regularly snapshot the EBS volume which holds our MySQL database using the basic procedure I outlined in the article “Running MySQL on Amazon EC2 with Elastic Block Store”, though the snapshot code has been significantly improved through our experience in the last year.

As others have reported, we also found that the background EBS snapshot process on EC2 increased IO wait on the source EBS volume holding the production database, which had a negative impact on the performance of the production web site itself. So, we moved the frequent snapshot process to a slave database which is always completely up to date with the production master database.

Aside: Having complete backups is not the only reason to do EBS snapshots. They also increase the reliability and failsafe-ness of the EBS volume itself, so we still run occasional snapshots on the production database, just during off-peak hours.

Steve Caldwell (chief tech at CampusExplorer.com) has automated some great push button EC2 system launches and configurations including EBS database volumes created from snapshots, attached, mounted, etc. This is convenient to do things like setting up a temporary development system for a contractor, starting a staging environment for QA, or running some database intensive reports on near-production data.

skip-slave-start

When EBS volumes were created from the snapshots of the slave database, Steve found that as soon as the MySQL server came up, it started replication from the master thinking it was a slave database. In some cases this is fine, but in others we want to see the database in exactly the state it was in at the time of the snapshot. We tried running “STOP SLAVE” just before creating the snapshot, but replication still resumed when the server started.

We finally found the solution in the option –skip-slave-start which tells MySQL to not start the replication process when the server comes up.

I haven’t looked into how to pass options to mysqld when running /etc/init.d/mysql so we settled on adding this line to the [mysqld] section of /etc/mysql/my.cnf by default on our non-slave servers:

skip-slave-start

On a system which we know should start as a slave, we omit this directive. On a system we want to turn into a slave, we simply run “START SLAVE” and remove this line for future restarts. If the binary logs available on the master go back to the point where the snapshot was created, then replication begins and it eventually catches up to the present.

[Update 2009-08-06: Clarify skip-start-slave meaning]

Updated Tutorial: Running MySQL on Amazon EC2 with EBS (now supports AppArmor)

The following tutorial (originally published in Aug ‘08) has been extensively updated today:

Running MySQL on Amazon EC2 with Elastic Block Store (EBS)

This tutorial explains one approach to using Amazon’s persistent storage mechanism as the backing for a database and includes pointers on how to create snapshots for secure backups.

The primary goal of the updates was to put forth an approach which works not only on the current Ubuntu and Debian AMIs published on https://alestic.com but also with new AMIs which use the Canonical kernels as well as the new Ubuntu AMIs published by Canonical.

Ubuntu AMIs which use the new Canonical kernels may have AppArmor enabled. The original tutorial required workarounds to function in this environment, but the new tutorial keeps files right where MySQL and the AppArmor configuration expect them to be, while at the same time keeping them on the EBS volume.

There is also a plethora of “sudo"s spread around the tutorial so that it will work if you connected to your instance using a normal, non-root user, as is required by the Canonical AMIs.

I have tested these instructions on a few different AMIs. Please let me know if you run into any problems or have suggestions for improvement.

=> Go read the tutorial