S3 Bucket Notification to SQS/SNS on Object Creation

A fantastic new and oft-requested AWS feature was released during AWS re:Invent, but has gotten lost in all the hype about AWS Lambda functions being triggered when objects are added to S3 buckets. AWS Lambda is currently in limited Preview mode and you have to request access, but this related feature is already available and ready to use.

I’m talking about automatic S3 bucket notifications to SNS topics and SQS queues when new S3 objects are added.

Unlike AWS Lambda, with S3 bucket notifications you do need to maintain the infrastructure to run your code, but you’re already running EC2 instances for application servers and job processing, so this will fit right in.

To detect and respond to S3 object creation in the past, you needed to either have every process that uploaded to S3 subsequently trigger your back end code in some way, or you needed to poll the S3 bucket to see if new objects had been added. The former adds code complexity and tight coupling dependencies. The latter can be costly in performance and latency, especially as the number of objects in the bucket grows.

With the new S3 bucket notification configuration options, the addition of an object to a bucket can send a message to an SNS topic or to an SQS queue, triggering your code quickly and effortlessly.

Here’s a working example of how to set up and use S3 bucket notification configurations to send messages to SNS on object creation and update.

Reset S3 Object Timestamp for Bucket Lifecycle Expiration

use aws-cli to extend expiration and restart the delete or archive countdown on objects in an S3 bucket

Background

S3 buckets allow you to specify lifecycle rules that tell AWS to automatically delete or archive any objects in that bucket after a specific number of days. You can also specify a prefix with each rule so that different objects in the same bucket stay for different amounts of time.

Example 1: I created a bucket named logs.example.com (not the real name) that automatically archives an object to AWS Glacier after it has been sitting in S3 for 90 days.

Example 2: I created a bucket named tmp.example.com (not the real name) that automatically delete a file after it has been sitting there for 30 days.

This works great until you realize that there are specific files that you want to keep around for just a bit longer than its original expiration.

You could download and then upload the object to reset its creation date, thus starting the countdown from zero; but through a little experimentation, I found that one easy way to reset the creation/modification timestamp on an S3 object is to ask S3 to change the object storage method to the same storage method it currently has.

The following example uses the new aws-cli command line tool to reset the timestamp of an S3 object, thus restarting the lifecycle counter. This has an effect similar to the Linux/Unix touch command.

Cost of Transitioning S3 Objects to Glacier

how I was surprised by a large AWS charge and how to calculate the break-even point

Glacier Archival of S3 Objects

Amazon recently introduced a fantastic new feature where S3 objects can be automatically migrated over to Glacier storage based on the S3 bucket, the key prefix, and the number of days after object creation.

This makes it trivially easy to drop files in S3, have fast access to them for a while, then have them automatically saved to long-term storage where they can’t be accessed as quickly, but where the storage charges are around a tenth of the price.

…or so I thought.

Seeding Torrents with Amazon S3 and s3cmd on Ubuntu

Amazon Web Services is such a huge, complex service with so many products and features that sometimes very simple but powerful features fall through the cracks when you’re reading the extensive documentation.

One of these features, which has been around for a very long time, is the ability to use AWS to seed (serve) downloadable files using the BitTorrent™ protocol. You don’t need to run EC2 instances and set up software. In fact, you don’t need to do anything except upload your files to S3 and make them publicly available.

Any file available for normal HTTP download in S3 is also available for download through a torrent. All you need to do is append the string ?torrent to the end of the URL and Amazon S3 takes care of the rest.

Steps

Let’s walk through uploading a file to S3 and accessing it with a torrent client using Ubuntu as our local system. This approach uses s3cmd to upload the file to S3, but any other S3 software can get the job done, too.

Identifying When a New EBS Volume Has Completed Initialization From an EBS Snapshot

On Amazon EC2, you can create a new EBS volume from an EBS snapshot using a command like

aws ec2 create-volume \
  --availability-zone us-east-1a \
  --snapshot-id snap-d53484bc

This returns almost immediately with a volume id which you can attach and mount on an EC2 instance. The EBS documentation describes the magic behind the immediate availability of the data on the volume:

“New volumes created from existing Amazon S3 snapshots load lazily in the background. This means that once a volume is created from a snapshot, there is no need to wait for all of the data to transfer from Amazon S3 to your Amazon EBS volume before your attached instance can start accessing the volume and all of its data. If your instance accesses a piece of data which hasn’t yet been loaded, the volume will immediately download the requested data from Amazon S3, and then will continue loading the rest of the volume’s data in the background."

This is a cool feature which allows you to start using the volume quickly without waiting for the blocks to be completely populated. In practice, however, there is a period of time where you experience high iowait when your application accesses disk blocks that need to be loaded. This manifests as somewhat to extreme slowness for minutes to hours after the creation of the EBS volume.

For some applications (mine) the initial slowness is acceptable, knowing that it will eventually pick up and perform with the normal EBS access speed once all blocks have been populated. As Clint Popetz pointed out on the ec2ubuntu group, other applications might need to know when the EBS volume has been populated and is ready for high performance usage.

Though there is no API status to poll or trigger to notify you when all the blocks on the EBS volume have been populated, it occurred to me that there was a method which could be used to guarantee that you are waiting until the EBS volume is ready: simply request all of the blocks from the volume and throw them away.

Here’s a simple command which reads all of the blocks on an EBS volume attached to device /dev/xvdX or /dev/sdX (substitute your device name) and does nothing with them:

sudo dd if=/dev/xvdX of=/dev/null bs=10M

OR

sudo dd if=/dev/sdX of=/dev/null bs=10M

By the time this command is done, you know that all of the data blocks have been copied from the EBS snapshot in S3 to the EBS volume. At this point you should be able to start using the EBS volume in your application without the high iowait you would have experienced if you started earlier. Early reports from Clint are that this method works in practice.

[Update 2012-04-02: We now have to support device names like sda1 and xvda1]

[Update 2019-08-07: New aws-cli command format]

Understanding Access Credentials for AWS/EC2

Amazon Web Services (AWS) has a dizzying proliferation of credentials, keys, ids, usernames, certificates, passwords, and codes which are used to access and control various account and service features and functionality. I have never met an AWS user who, when they started, did not have trouble figuring out which ones to use when and where, much less why.

Amazon is fairly consistent across the documentation and interfaces in the specific terms they use for the different credentials, but nowhere have I found these all listed in one place. (Update: Shlomo pointed out Mitch Garnaat’s article on this topic which, upon examination, may even have been my subconscious inspiration for this. Mitch goes into a lot of great detail in his two part post.)

Pay close attention to the exact names so that you use the right credentials in the right places.

(1) AWS Email Address and (2) Password. This pair is used to log in to your AWS account on the AWS web site. Through this web site you can access and change information about your account including billing information. You can view the account activity. You can control many of the AWS services through the AWS console. And, you can generate and view a number of the other important access keys listed in this article. You may also be able to order products from Amazon.com with this account, so be careful. You should obviously protect your password. What you do with your email address is your business. Both of these values may be changed as needed.

(3) MFA Authentication Code. If you have ordered and activated a multi-factor authentication device, then parts of the AWS site will be protected not only by the email address and password described above, but also by an authentication code. This is a 6 digit code displayed on your device which changes every 30 seconds or so. The AWS web site will prompt you for this code after you successfully enter your email address and password.

(4) AWS Account Number. This is a 12 digit number separated with dashes in the form 1234-5678-9012. You can find your account number under your name on the top right of most pages on the AWS web site (when you are logged in). This number is not secret and may be available to other users in certain circumstances. I don’t know of any situation where you would use the number in this format with dashes, but it is needed to create the next identifier:

(5) AWS User ID. This is a 12 digit number with no dashes. In fact, it is simply the previously mentioned AWS Account Number with the dashes removed (e.g., 12345678912). Your User ID is needed by some API and command line tools, for example when bundling a new image with ec2-bundle-vol. It can also be entered in to the ElasticFox plugin to help display the owners of public AMIs. Again, your User ID does not need to be kept private. It is shown to others when you publish an AMI and make it public, though it might take some detective work to figure out who the number really belongs to if you don’t publicize that, too.

(6) AWS Access Key ID and (7) Secret Access Key. This is the first of two pairs of credentials which can be used to access and control basic AWS services through the API including EC2, S3, SimpleDB, CloudFront, SQS, EMR, RDS, etc. Some interfaces use this pair, and some use the next pair below. Pay close attention to the names requested. The Access Key ID is 20 alpha-numeric characters like 022QF06E7MXBSH9DHM02 and is not secret; it is available to others in some situations. The Secret Access Key is 40 alpha-numeric-slash-plus characters like kWcrlUX5JEDGM/LtmEENI/aVmYvHNif5zB+d9+ct and must be kept very secret.

You can change your Access Key ID and Secret Access Key if necessary. In fact, Amazon recommends regular rotation of these keys by generating a new pair, switching applications to use the new pair, and deactivating the old pair. If you forget either of these, they are both available from AWS.

(8) X.509 Certificate and (9) Private Key. This is the second pair of credentials that can be used to access the AWS API (SOAP only). The EC2 command line tools generally need these as might certain 3rd party services (assuming you trust them completely). These are also used to perform various tasks for AWS like encrypting and signing new AMIs when you build them. These are the largest credentials, taking taking the form of short text files with long names like cert-OHA6ZEBHMCGZ66CODVHEKKVOCYWISYCS.pem and pk-OHA6ZEBHMCGZ66CODVHEKKVOCYWISYCS.pem respectively.

The Certificate is supposedly not secret, though I haven’t found any reason to publicize it. The Private Key should obviously be kept private. Amazon keeps a copy of the Certificate so they can confirm your requests, but they do not store your Private Key, so don’t lose it after you generate it. Two of these pairs can be associated with your account at any one time, so they can be rotated as often as you rotate the Access Key ID and Secret Access Key. Keep records of all historical keys in case you have a need for them, like unencrypting an old AMI bundle which you created with an old Certificate.

(10) Linux username. When you ssh to a new EC2 instance you need to connect as a user that already exists on that system. For almost all public AMIs, this is the root user, but on Ubuntu AMIs published by Canonical, you need to connect using the ubuntu user. Once you gain access to the system, you can create your own users.

(11) public ssh key and (12) private ssh key. These are often referred to as a keypair in EC2. The ssh keys are used to make sure that only you can access your EC2 instances. When you run an instance, you specify the name of the keypair and the corresponding public key is provided to that instance. When you ssh to the above username on the instance, you specify the private key so the instance can authenticate you and let you in.

You can have multiple ssh keypairs associated with a single AWS account; they are created through the API or with tools like the ec2-add-keypair command. The private key must be protected as anybody with this key can log in to your instances. You generally never see or deal with the public key as EC2 keeps this copy and provides it to the instances. You download the private key and save it when it is generated; Amazon does not keep a record of it.

(13) ssh host key. Just to make things interesting, each EC2 instance which you run will have its own ssh host key. This is a private file generated by the host on first boot which is used to protect your ssh connection to the instance so it cannot be intercepted and read by other people. In order to make sure that your connection is secure, you need to verify that the (14) ssh host key fingerprint, which is provided to you on your first ssh attempt, matches the fingerprint listed in the console output of the EC2 instance.

At this point you may be less or more confused, but here are two rules which may help:

A. Create a file or folder where you jot down and save all of the credentials associated with each AWS account. You don’t want to lose any of these, especially the ones which Amazon does not store for you. Consider encrypting this information with (yet another) secure passphrase.

B. Pay close attention to what credentials are being asked for by different tools. A “secret access key” is different from a “private key” is different from a “private ssh key”. Use the above list to help sort things out.

Use the comment section below to discuss what I missed in this overview.