Solving: "I can't connect to my server on Amazon EC2"

Help! I can’t connect to my EC2 instance!
Woah! My box just stopped talking to me!
Hey! I can’t access the server!

These and other variations on the connectivity theme are some of the most common problems raised on the Amazon EC2 forum.

The EC2 community and Amazon employees do a valiant job helping users track down and solve these issues despite the facts that (1) there are hundreds of reasons why a server or service might not be accessible, (2) connectivity is one of the harder problems to diagnose, especially without being hands-on, and (3) users complaining about a problem generally don’t provide the clues necessary to solve the issue (because the ones who knew what those clues were probably solved it themselves and didn’t post).

This article is an attempt to provide some general assistance to folks who are experiencing connectivity issues with Amazon EC2. Please post additional help in the comments; this document will be updated over time.

Questions

First off, you should understand that it’s ok to ask for help. When you do, though, you should provide as many details as possible about what you are trying to do and what results you are seeing. It also helps if you drop some clues about your level of expertise. A person using Linux for the first time is likely to make different mistakes on EC2 than a person who is having problems connecting to a custom AMI they built from scratch.

The more specific you can be about your problem and the more information you can provide, the more likely somebody will be able to help. Here are some common questions which are important to have answered for connectivity problems on EC2:

  1. When you say you “can’t connect” what application are you trying to use and on what port? For example: “ssh to port 22” or “accessing port 80 with Firefox”. If you don’t know what a port is, then provide as many details as possible about the application you’re using and what command or steps you are taking to initiate the connection.

  2. What, specifically, happens when you try to connect? Does it hang for a long time and eventually time out? Do you get an error message? What is the exact text you see? (Copy and paste, don’t summarize.)

  3. What is the AMI id which the instance is running? If it is not a public AMI, then what is the AMI id of the public AMI it is based on?

  4. What Linux distro and release is the instance running? E.g., Ubuntu 9.04 Jaunty, Debian etch, Fedora 8, CentOS 5.

  5. What is the instance id of the instance you are trying to contact? Providing this can let Amazon employees take a look at the internals of what might be going on.

  6. What are the internal and external IP addresses and/or host names for the instance you are trying to reach? Providing this information is, in effect, giving permission to the community to try to contact your server over the network so that they can gather information about connectivity and help solve your problem.

  7. Have you ever been able to contact this instance in the past? How recently?

  8. How long has the instance been running?

  9. Have you ever been able to contact another instance of the same AMI?

  10. Is there a difference in connectivity when you try from another EC2 instance instead of from the Internet?

  11. What were you doing when the connectivity stopped?

  12. What is the console output of the instance? You can get this through an API client or a command like:

     ec2-get-console-output INSTANCE_ID
    

There are so many reasons that connectivity might be down to a remote server or service that it would be impossible to get a significant percentage of them listed in one article. I’ll start by listing some of the more common problems here; please add to the comments as you run into or remember others.

You

By far the most common cause of the problem is you (the person experiencing the problem) and that’s ok. We all make mistakes. It’s important, though, that you start with this attitude: open to the possibilities that you typed something wrong, forgot a step, or didn’t quite understand the complex instructions. Ninety percent of the people reading this paragraph think I’m talking to somebody else; oddly, they also think this sentence is not about them.

Here are some of the most common reasons folks (including me) can’t connect to their Amazon EC2 instance. Really.

  • You’re not connecting to the right instance or to the instance you think you’re trying to connect to. Servers on EC2 are identified by opaque instance ids like i-ae1df2c6 and opaque host names like ec2-75-101-182-20.compute-1.amazonaws.com. It’s easy for anybody to get these confused or mistype them.

  • The instance you’re trying to connect to has not completed the boot process yet. Though some AMIs are ready to connect in under a minute, others can take 10+ minutes.

  • The instance you’re trying to connect to has been terminated. (Did you just shut down what you thought was a different instance?)

  • The service you are trying to reach on the instance is not running on that instance.

  • The service you are trying to reach on the instance is not listening on that port or that network interface.

  • You did not open the port in the security group.

  • You did not start the instance with the correct security group.

  • You did not start the instance with the same ssh keypair as you are using to access it.

  • Your local firewall is preventing you from getting out to that port on any server outside your network. Talk to your local network administrators.

  • Your firewall on the instance is preventing access to the service. Try shutting down iptables temporarily to see if that helps.

You “experts” laugh when you read these, but if you’re having trouble reaching a server, I recommend you go through each one carefully and double check that your assumptions are correct and the world is really as you remember it. Remember: We all make mistakes. A lot of these come from personal experience.

If you’re not quite sure what terms like “security group” and “keypair” mean in the EC2 context, I recommend going back and reading some introductory material. These are important concepts for beginners.

ssh

The ssh connectivity problems generally fall into a couple major buckets

  1. ssh is not accessible, or

  2. ssh is rejecting the connection due to a failure to authenticate or authorize

You can find out which type of problem you have by using a command like

telnet HOSTNAME 22

If this connects, then ssh is running and accepting connections on port 22. (Hit [Enter] a couple times to disconnect from the telnet session). If you don’t connect, then it’s important to note if the attempt basically hung forever or if you got a “Connection refused” type of message immediately. (Hit [Ctrl]-[C] to stop the telnet command.)

If the connection attempt hangs, then there might be a problem with the security group, iptables, or your instance might not be running at that IP address.

If the telnet connection attempt gets rejected, then there might be a problem with iptables, ssh configuration, ssh not running on the instance, or perhaps it’s listening on different port if the admin likes to configure things a bit more securely. The console output can be helpful in determining if sshd was started at boot.

If you can get connected to the ssh port with telnet, then you need to start debugging why ssh is not letting you in. The most important information can be gathered by running the ssh connection attempt in verbose (-v) mode:

ssh -v -i KEYPAIR.pem USERNAME@HOSTNAME

The complete output of this command can be very helpful to post when asking for help.

The most common problems with ssh relate to:

  • Forgetting to specify -i KEYPAIR.pem in the ssh command

  • Not starting the instance specifying a keypair

  • Using a different keypair than the one which was used to start the instance

  • Not ssh’ing with the correct username. Amazon Linux instances require a first ssh connection with ec2-user@... while mages published by Canonical require a first connection with ubuntu@... and others might need you to ssh using root@...

  • Not having the correct ownership or mode on the .ssh directory or authorized_keys file.

  • Not having the correct Allow* or *Authentication settings in /etc/ssh/sshd_config

Apache

Web servers are much easier to connect to than other applications because there is generally no authentication and authorization involved to get a basic web page. If you can’t reach your web server on EC2, then it’s generally one of the simple problems described above like using the wrong IP address, trying to reach a terminated instance, or not having the web port opened in the security group.

MySQL

The most common problem specific to MySQL connectivity on EC2 is the fact that MySQL is configured securely by default to not allow access by remote hosts. If you need to allow a connection from your other instances running in EC2, then edit /etc/mysql/my.cnf and replace this line:

bind-address            = 127.0.0.1

with

bind-address            = 0.0.0.0

and restart the mysqld server.

IMPORTANT! You should not open the MySQL port in the EC2 security group. You only want your own EC2 instances to connect to the database and the default security group allows your EC2 instances to connect to any port on your other EC2 instances. If you open up the port to the public, then your database will be attacked by the Internet at large.

If you need to talk to your MySQL database running on EC2 from a server running outside EC2, then do it over a secure channel like an ssh tunnel or openvpn. You don’t need the MySQL port open in the security group to do this. The MySQL protocol is not by itself encrypted and your usernames and passwords would be sent in the clear for anybody else to intercept if you didn’t talk over a secure channel.

Custom AMIs

If you are building your own custom AMIs from scratch, then there are a number of complicated barriers to getting network and ssh connectivity working. Unfortunately it is nearly impossible to debug these problems since you don’t have access to the machine to see what went wrong. Console output is your only friend in these cases.

Here are some examples of odd things which others in the EC2 community have run into and solved:

  • Make sure you start networking on instance boot. It should come up with DHCP on eth0.

  • Make sure your Linux distro does not save the MAC address somewhere, preventing the network from functioning in the next instance. Ubuntu stores this in the /etc/udev/rules.d/70-persistent-net.rules file and Debian stores this in the /etc/udev/rules.d/z25_persistent-net.rules file.

  • Make sure your image downloads the ssh keypair and installs it in authorized_keys.

  • Make sure you have the right devices created and file systems mounted.

  • Make sure you’re using a udev lower than v144 as higher versions are incompatible with Amazon’s 2.6.21 kernel.

  • Make sure you’re using the right libc6 and related configurations including /lib/tls

Amazon

I realize this was your first thought, but it’s such a rare cause, I’ve put it here at the end. Sometimes there are problems with Amazon EC2. The hardware running your instance may fail or the networks might have temporary glitches. There are a couple different classes of problems here:

  1. Small scale problems local to the hardware running your instance. Though these are rare for any single instance, they are happening all the time for some customer somewhere given that AWS has hundreds of thousands of customers. Amazon often sends you an email when they notice that an instance is starting to have problems, and you should move to a new instance as soon as possible. If the failure happens without the warning, the only solution is to move to a new instance anyway, so you should always be prepared to do this.

  2. Large scale problems which affect a large number of customers simultaneously. These are very rare, and generally don’t affect more than a single availability zone given the way that Amazon has spread out the risk in their architecture.

You can check the AWS service health dashboard to see if Amazon is aware of any widespread problems with the EC2 service. If there are problems with a specific availability zone, you may want to move your servers to a different availability zone until the issues get resolved.

First Responses

For general cases where you can’t immediately figure out what went wrong with the connectivity, here are two things which are almost always recommended on EC2: reboot the instance and replace the instance.

Reboot your EC2 instance using the EC2 Console, another API client, or a command like:

    ec2-reboot-instances INSTANCE_ID

After giving it sufficient time to come up, see if that fixed the connectivity problem. Do not reboot your instance if you currently have a working ssh connection to it, but other ssh connections are failing!

If you have a production service running on Amazon EC2 and you lose connectivity to an instance, then I recommend your first reaction be to kick off a replacement instance so that it boots and configures itself while you investigate the original issue. If you don’t solve the problem by the time the replacement is ready, simply switch over to the new server. You may want to continue investigating what happened with the old server, though I generally don’t care what the problem was unless it happens more than once or twice in a short time period.

If your installation environment does not allow you to easily start replacement instances, then you should reconsider how you are using EC2 and work to improve this.

Seeking Help

If the above did not help you solve your problem reaching your EC2 instance, you may want to reach out to the community including some AWS employees on the EC2 forum.

Amazon also has premium AWS support available.

Requests for connectivity help by posting a comment on this particular thread will not be published or answered. Please only post a comment if you have corrections or additional information to share for users experiencing problems. I do occasionally receive and respond to questions posted on other articles, but for this topic, please use the EC2 forum.

[Update 2011-10-01: Amazon Linux requires ssh using ubuntu@...]

Using RAID on EC2 EBS Volumes to Break the 1TB Barrier and Increase Performance

Amazon EC2 currently has a limit of 1,000 GB (1 TB) for EBS volumes (Elastic Block Store). It is possible to create file systems larger than this limit using RAID 0 across multiple EBS volumes. Using RAID 0 can also improve the performance of the file system reducing total IO wait as demonstrated in a number of published EBS performance tests.

The following instructions walk through one way to set up RAID 0 across multiple EBS volumes. Note that there is a limit on the size of a file system on 32-bit instances, but 64-bit instances can get unreasonably large. This test was run with 40 EBS volumes of 1,000 GB each for a total of 40,000 GB (40 TB) in the resulting file system.

Actual command line output showing the size of the RAID:

# df /vol
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/md0             41942906368      1312 41942905056   1% /vol

# df -h /vol
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               40T  1.3M   40T   1% /vol

These commands can run in less than 10 minutes and this could probably be reduced further by parallelizing the creation and attaching of the EBS volumes.

Note that the default limit is 20 EBS volumes per EC2 account. You can request an increase from Amazon if you need more.

Caution: 40 TB of EBS storage on EC2 will cost $4,000 per month plus usage charges.

Instructions

Start a 64-bit instance (say, Ubuntu 8.04 Hardy from https://alestic.com). Use your own KEYPAIR:

ec2-run-instances \
  --key KEYPAIR \
  --instance-type c1.xlarge \
  --availability-zone us-east-1a \
  ami-0772946e

Configurable parameters (set on both local host and on EC2 instance):

instanceid=i-XXXXXXXX
volumes=40
size=1000
mountpoint=/vol

On the local host (with EC2 API tools installed)…

Create and attach EBS volumes:

devices=$(perl -e 'for$i("h".."k"){for$j("",1..15){print"/dev/sd$i$j\n"}}'|
           head -$volumes)
devicearray=($devices)
volumeids=
i=1
while [ $i -le $volumes ]; do
  volumeid=$(ec2-create-volume -z us-east-1a --size $size | cut -f2)
  echo "$i: created  $volumeid"
  device=${devicearray[$(($i-1))]}
  ec2-attach-volume -d $device -i $instanceid $volumeid
  volumeids="$volumeids $volumeid"
  let i=i+1
done
echo "volumeids='$volumeids'"

On the EC2 instance (after setting parameters as above)…

Install software:

sudo apt-get update &&
sudo apt-get install -y mdadm xfsprogs

Set up the RAID 0 device:

devices=$(perl -e 'for$i("h".."k"){for$j("",1..15){print"/dev/sd$i$j\n"}}'|
           head -$volumes)

yes | sudo mdadm \
  --create /dev/md0 \
  --level 0 \
  --metadata=1.1 \
  --chunk 256 \
  --raid-devices $volumes \
  $devices

echo DEVICE $devices       | sudo tee    /etc/mdadm.conf
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm.conf

Create the file system (pick your preferred file system type)

sudo mkfs.xfs /dev/md0

Mount:

echo "/dev/md0 $mountpoint xfs noatime 0 0" | sudo tee -a /etc/fstab
sudo mkdir $mountpoint
sudo mount $mountpoint

Check it out:

df -h $mountpoint

When you’re done with it and want to destroy the data and stop paying for storage, tear it down:

sudo umount $mountpoint
sudo mdadm --stop /dev/md0

Terminate the instance:

sudo shutdown -h now

On the local host (with EC2 API tools installed)…

Detach and delete volumes:

for volumeid in $volumeids; do
  ec2-detach-volume $volumeid
done

for volumeid in $volumeids; do
  ec2-delete-volume $volumeid
done

Credits

This article was originally posted on the EC2 Ubuntu group.

Thanks to M. David Peterson for the basic mdadm instructions:

[Update 2012-01-21: Added –chunk 256 based on community recognized best practices.]

Keeping File Ownership (UIDs) Consistent when Using EBS on EC2

Persistent storage on Amazon EC2 is accomplished through the use of Elastic Block Store (EBS) volumes. EBS is basically a storage area network (SAN) and can be thought of as an on-demand, virtual, redundant hard drive plugged in to the server with super-powers like snapshot/restore.

An EBS volume can be detached from one EC2 instance and attached to another. You can create a snapshot of an EBS volume and create new volumes from the snapshot to attach to other instances. Though this flexibility provides some useful abilities, it also presents some challenges.

In particular, the files stored on the EBS volume will be owned by specific numeric UIDs (users) and GIDs (groups). When you fire up and configure a new instance, the UIDs and GIDs on the EBS volume may not exactly match the numeric ids of the users and groups on the new instance, depending on how you set it up.

For example, when you install the MySQL software, the package will generally create a new “mysql” user with the next available UID. If you don’t create the various users in exactly the same order on new instances, you may end up with your database files owned by the “postfix” user instead of the “mysql” user. It’s happened to me and I’m not the only one.

There is a discussion about this topic on the ec2ubuntu Google Group and it has also been raised on Canonical’s EC2 beta mailing list.

Here are some of the different approaches to avoiding or fixing this problem:

  1. Bundle your own AMIs and always run instances of the same AMI when attaching EBS volumes with files. This works if you already have to bundle your AMIs for other reasons, but I often recommend against AMI rebundling because of the efforts involved, lack of reproducibility, and maintenance problems when the base image gets updated or has bugs fixed.

  2. Automate the creation of users and installation of packages in exactly the same order every time. This is likely to give you the same UID/GID values for each user, but it starts to get messy if you end up with an order mixing human users and software package users:

  3. Create all users/groups with hardcoded UIDs/GIDs before installing software packages. If you automate the creation of users and groups you can force the “mysql” and “postfix” users to have a specific UID value. Then you install the MySQL and Postfix packages and the software will use the users which already exist on the system. We ended up following this approach with our EC2 servers at CampusExplorer.com

  4. Correct the ownership of files after mounting the EBS volume. This feels a bit messy to me, but it might be the only option in some cases. I must admit that I’ve done this manually a number of times, but only after finding problems like MySQL not starting because the files aren’t owned by the correct user. For example, say you needed to change files currently owned by “postfix” to be correclty owned by “mysql”:

     find /vol -user postfix -print0 | xargs -0 chown mysql
    

    If you are changing ownership of files after mounting the EBS volume, make sure you do it in an order which does not lose information. For example, if you have to swap “postfix” and “mysql” users, you’ll need to use a temporary third UID as a placeholder.

  5. On the ec2ubuntu Google group it was suggested that a central authority might be a way to solve the problem. I’ve never used this approach on Linux and am not sure how much work it would be setting up a reliable service like this on EC2.

No matter what approach you use, it might be a good idea to add in some checks after you mount an EBS volume to make sure that the files are owned by the appropriate users. For example, you might verify that the mysql directory is owned by the mysql user

Solving this problem is something that I have only begun to work on, so I would appreciate any comments, pointers, and solutions that you may have.