Monitor an SNS Topic with AWS Lambda and Cronitor.io

get alerted when an expected event does NOT happen

Last week I announced the availability of a public SNS Topic that may be used to run AWS Lambda functions on a recurring schedule. To encourage folks to realize the implications of a free community service maintained by an individual, I named it the “Unreliable Town Clock”.

Even with this understanding, some folks in the AWS community have (again) placed their faith in me and are already starting to depend on the Unreliable Town Clock public SNS Topic to drive their own AWS Lambda functions and SQS queues, and I want to make sure this service is as reliable as I can reasonably make it.

Here are some of the steps I have taken to increase the reliability of the Unreliable Town Clock:

  1. Runs in a dedicated AWS account. This helps prevent human error and accidents when working on other projects.

  2. Uses restrictive IAM roles/policies and good security practices. EC2 security groups don’t allow any incoming connections, not even ssh. I destroyed the root AWS account password and there are no IAM users.

  3. Auto Scaling group is used to trigger automatic instance re-launch if a running one fails. In my tests, this takes a matter of minutes.

  4. Built reproducibly using a CloudFormation template. This means it’s easy to re-create in the event of a complete disaster, though it would still be bad if the SNS Topic disappeared, as clients would need to resubscribe.

  5. The SNS Topic itself is protected from deletion even if a delete request were somehow submitted for the CloudFormation stack.

  6. The SNS Topic is constantly monitored using AWS Lambda and Cronitor.io. The first delayed or missed chime will trigger alerts to a human and will keep alerting until corrected.

The rest of this article elaborates on this last point of protection.

Delayed/Missing SNS Message Monitoring and Alerting

Most monitoring and alerting services are designed to poll your resources and sound the alarm when the polled resource fails to respond, reports an error, or exceeds some threshold.

This works great for finding out when your web site is down or your server is unpingable. It doesn’t work so well for letting you know when your hourly cron job didn’t run, your ETL aborted mid-stream, or your expected daily email was not received.

And normal monitoring and alerting also can’t tell you when it’s been more than 15 minutes since the last message was published to your SNS Topic, which is exactly what I need to know in order to respond quickly to any failures of the Unreliable Town Clock that aren’t automatically handled by the AWS architecture.

Fortunately, this is exactly the type of monitoring and alerting that Cronitor is designed for. Here’s how I set it up:

  1. Sign up on Cronitor.io and create a new monitor (first monitor is free with email alerts). In my case, I selected “Notify me if [time since run exceeds] [16] [minutes]”.

  2. Create a simple AWS Lambda function that does an HTTP GET on your monitor’s run URL (e.g., https://cronitor.link/d3x0/run). See the sample code below.

  3. Subscribe the AWS Lambda function to the SNS Topic. See example instructions on the Unreliable Town Clock post.

Now, if the SNS Topic goes longer than 16 minutes between chimes, I get personally alerted so I can go investigate and whip the Unreliable Town Clock back into shape.

Here’s some simplified AWS Lambda code that demonstrates how easy it is to ping a Cronitor.io monitor. The code I am running is slightly more involved with extra logging and parameterization of my monitor URL outside of the code, but this would do the job if you plugged in your own monitor run URL.

var request = require('request');
exports.handler = function(event, context) {
    request('https://cronitor.link/d3x0/run',
            function(error, response, body) {
                context.done(error, body);
            }
    );
};

Disclaimer: I am not a nodejs expert. I just Google what I want to do and try Stack Overflow answers until it seems to work. Ideas for improvement welcomed.

I suspect that I should be able to do some similar monitoring and alerting with CloudWatch Metrics and CloudWatch Alarms, and I may eventually work this out, but I still like to have some monitoring managed by an external party who is taking responsibility to make sure their system is running and who will notify me when mine is not.

I rarely plug non-AWS services on this blog, but I love the simple design and powerful functionality of Cronitor.io and think the service fills an important need. In my brief time using the service, August and Shane have been incredibly helpful, generous, and responsive to suggestions for improvements.

If you become a paying customer, don’t let me stop you from suggesting that Cronitor support direct SNS Topic monitoring (eliminating the AWS Lambda step above) if you think that would be something you would use ;-)

Oh, and in case it wasn’t completely obvious, you can use the procedure described in this article to directly monitor the Unreliable Town Clock public SNS Topic yourself and get your own alerts if it ever misses a chime. Or, you can use it to monitor the reliability of the AWS Lambda function you subscribe to the SNS Topic, making sure that it completes successfully as often as it is supposed to.