Using AWS Lambda to call and text you when your servers are down

Dobbs · on Dec 6, 2016

From a engineering point of view this is really cool, but as an ex-sysadmin I feel that I need to reiterate and emphasize something that is alluded to in the second paragraph.

Too many things can go wrong and you are all around better off outsourcing this to something like Pingdom. You don't have sufficient levels of reliability, you aren't dual homed across twilio and another phone system. Maybe the cause of your outage is that AWS is having issues. Now your site and your monitoring is down.

Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.

nhm · on Dec 6, 2016

> outsource to people who obsess over doing this right

Completely agree! I often have to fight that "I could just build that myself" mentality, which glosses over the points you made so well.

avitzurel · on Dec 6, 2016

This and that!

It's the same as "Twitter clone" with just posting messages with 140 char limit and "build a blog in 15 minutes".

Alerting over a downed website are is sorta like a glacier, there's so much under the surface, if you just see the surface you're missing out.

1. Multiple locations 2. Multiple check intervals 3. SMS/email provider switch on fail 4. Auto recovery of your checkers 5. Multiple providers with a single storage.

melvinmt · on Dec 6, 2016

> Now your site and your monitoring is down. Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.

You make valid points about redundancy and levels of reliability but keep in mind that even Pingdom can go down: http://royal.pingdom.com/2016/10/24/ddos-attack-affects-ping...

user5994461 · on Dec 6, 2016

Chances are that pingdom won't be down at the same time that your site is down.

Diversify to avoid cascading failures ;)

imtringued · on Dec 6, 2016

With your own solution you will likely encounter the same problems that pingdom faced including this one. The benefit of a service like pingdom is that they already solved those problems for you or if they haven't you don't have to waste time solving them yourself. It's not very efficient if everyone solves the same problems over and over again.

dharma1 · on Dec 6, 2016

Use 2 or more providers. Signing up takes a minute or two and there are free alternatives

IgorPartola · on Dec 6, 2016

My favorite issue recently came up with a Django app of mine which was set up to email me when a request errors out. Turns out, when I switched which server it ran on I misconfigured the email settings and one of the errors was caused due to the inability to send an email. Thankfully it only took a few days to figure this out.

teddyh · on Dec 6, 2016

We’ve had issues with Pingdom at work. We don’t use them ourselves, but we host web sites, and some customer of ours used Pingdom to monitor their web site hosted on our servers. The customer would complain to us about downtime reported by Pingdom, but we would read the logs and find everything OK, with multiple successful accesses from other people during the time which Pingdom reported our customer’s site as being down. A huge pain.

snom380 · on Dec 6, 2016

Doesn't services like Pingdom support multiple ping locations? If all of those fail, there's a very high chance there's an actual problem, if not with your server then with your (ISPs) connectivity.

belorn · on Dec 6, 2016

The question is what customers of monitoring systems expect from the monitoring. Do pingdoms explain what a failure means, or are they only providing data and then its up to the customer to interpret that data.

Multiple ping locations is helpful in bringing more data points, but it doesn't address the problem of explaining what the data means. For example, pingdom could provide triangulation of the failure if fault identification was part of the businesses model of monitoring.

I would describe the criticism of pingdom as a failure of expectations. Pingdom is not a security service, a monitoring service, or fault identification service. They are a single test, and the data you get back is useless unless interpreted and verified.

teddyh · on Dec 6, 2016

If our ISP was down, we would not have had successful accesses from other people at the same time. If some transit ISP was down somewhere between us and Pingdom, well, that’s the Internet for you, eh? Regardless, Pingdom would report us as down, even though we weren’t at fault.

dalore · on Dec 6, 2016

Yes you were down for some of your users. If that's ok for you that's fine. But if I were you I would be calling my ISP and trying to sort out why customers from location X can't access but customers from location Y can.

If you're providing a service to your users, and they say that the service is down using pingdom, you should be looking into, not just saying "Works on my machine".

teddyh · on Dec 6, 2016

Why should we be the ones to look into it? It was a random intermittent short-duration fault in the middle if the Internet, at some unknown place on the then-current path between us and Pingdom. Why should not Pingdom be at least equally as obligated to look into it? After all, they’re the ones actually using the failing connection, in order to monitor our and others’ services. But no, Pingdom simply report us as being down, and leave the hard part to us; i.e. the part where we have to explain to our customers that the Pingdom report is actually provably incorrect.

I mean, what qualifies as “being up”? If some random link in the middle of the Internet goes down, and you suddenly, for 30 seconds, are unreachable for the few hundred people going through that exact link because it happens to be the best path between those people and your server, can they claim that you have failed to provide adequate uptime? If such a fault happens, are you then responsible to troubleshoot it? I say no. The Internet is the ISP’s responsibility, and the only faults actually meaningful to report to your ISP are the repeatable or long-lasting ones. Small stuff like this is not worth anybody’s time (except ISPs) to go digging into.

dalore · on Dec 7, 2016

Well if you're not providing a service to others, then you shouldn't be the ones to look into it. But if you're providing a service to users and they tell you it's down then you should. It might be that your ISP has a misconfigured route that is flapping and sometimes causes errors in some locations. Or a netmask is wrong somewhere and certain ip address can't be accessed. It might not be a temporary thing. And you if it's your ISP fault they might be able to fix it.

You've seem to think that you have to investigate the issues. On the contrary, you bump it up to your isp to investigate. If your ISP is regularly having these issues then it might be time to change ISPs to one with a better peering agreement.

teddyh · on Dec 7, 2016

If the outages were one of:

1. Reported as being experienced by an actual user of a web site,

2. Longer than a a couple of minutes at most (usually just a few seconds),

3. or happened more frequently than a few times per month,

then I might consider reporting it to my ISP. As it is, it’s not worth it. “Cosmic rays, man.” (https://www.joelonsoftware.com/2001/07/31/hard-assed-bug-fix...).

chinathrow · on Dec 6, 2016

We've seen the same with Pingdom. ISPs are (mostly) multi-homed and Pingdom might use just one (affected) route to the ISP in question and then fail badly.

If Pingdom can't get to your site, it's highly likely your users can't either.

teddyh · on Dec 6, 2016

That was not the case with us; Pingdom would report short outages, like a few seconds here, a couple of minutes there, and only a handful of occurrences for the whole report duration (IIRC).

tjholowaychuk · on Dec 6, 2016

Yep, plus most engineering time is worth at minimum $60+/h, which would pay for a year or more with most of these services.

vacri · on Dec 6, 2016

On the other hand, it's 'set up once and it just keeps chugging along', and isn't Yet Another SaaS To Manage.

Also, if you want a 'proper' ops alerting SaaS, you're looking at something along the lines of $50/user/mo or $15/server/mo, neither of which is trivial.

tjholowaychuk · on Dec 6, 2016

Yeah assuming nothing falls apart with the custom implementation maintenance-wise. Programmers have a hard time focusing on their real goals though, we often re-implement things that really aren't worth the time or money.

falcolas · on Dec 6, 2016

And I'm sure Lambda will never go down. Right? Right??

(It has. Completely and silently stopped processing against Kinesis queues for a few hours recently. Guess what AWS Step is built on?)

jlgaddis · on Dec 6, 2016

Well, sure, of course it will, but I don't think Nick is advocating replacing a complete, full featured monitoring system with this.

It could be very useful to, for example, keep an eye on your monitoring system. At $work, we have a pretty extensive monitoring system that we've built out. We use an external service to watch over the monitoring system, though, to alert us of any issues with it that we haven't otherwise caught.

Besides, like he said, it's "fun" and kinda neat.

cddotdotslash · on Dec 6, 2016

Of course it can go down, and you can have CloudWatch alerts to alert you about that. But so can your Nagios server sending pings go down or the fancy SaaS you signed up for.

robinson-wall · on Dec 6, 2016

Did you just suggest using a third AWS service to let you know if the second AWS service monitoring your first AWS service goes down?

cddotdotslash · on Dec 6, 2016

Yes, because they're different services running on different architecture and distributed differently. I challenge you to find one time in the past five years where CloudWatch was down at the same time as other services. Even if you can, I'm sure your custom built Nagios server in your datacenter has gone down as many or more times too.

But my bigger point here is that you're essentially asking "well how do you monitor your monitor?" At which point up the chain do you have enough? Also, I think the original post was simply a demo of what is possible. Yet whenever someone posts something, people go in the comments to belittle it. "Yeah, you built a monitoring solution... Well what happens if that goes down?"

Which is a legitimate question. But obviously if your production service is that critical to your business, you won't be monitoring it with a service that costs $0.0000002 per execution.

falcolas · on Dec 6, 2016

> they're different services running on different architecture and distributed differently

I think you underestimate the interdependency of services in AWS. Historically, if there were problems with S3 or EBS in us-east-1, you could expect the entire API to be flaky, and things like autoscaling to fail. These have been better distributed, but failures still cascade.

> I think the original post was simply a demo of what is possible

No, it wasn't a demo, it was an actual production issue. No alarms, no error logs, no way to tell it wasn't working other than someone noticing the queues were getting larger and contacting AWS.

> people go in the comments to belittle it

Only because the original project projects AWS Lambda as "the solution" for such problems, not realizing that it is just as fallible a solution as everything else.

> Well what happens if that goes down?

The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.

> But obviously if your production service is that critical to your business, you won't be monitoring it with [this] service

Then what's it's value, other than as an intellectual exercise?

cddotdotslash · on Dec 6, 2016

Launch two Lambda functions, heck, 8 Lambda functions, one in each AWS region that supports it. They all monitor one another, plus run your checks. Next, are you going to say all 8 regions will go down at once?

The whole setup will still cost $0/month.

> The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.

Because not everyone needs heavy solutions to do something simple. Side projects, small sites, etc. And some people enjoy implementing old use cases using new technology. When Go was rising in popularity, half the posts on the front page were re-implementing fairly common features in Go.

Even if you're not going to implement this yourself, there can still be some value for other readers.

falcolas · on Dec 6, 2016

> are you going to say all 8 regions will go down at once

I hope not. But then it's not just Lambda triggered by cloudwatch alarms anymore. You'd probably have to set up something to ensure that Lambda, when called via cloudwatch alarms, is being triggered properly. Useful, but suddenly a lot more complicated.

> The whole setup will still cost $0/month.

Unlikely. A small amount, but certainly not 0. Especially when you start adding Lambda heartbeats.

> And some people enjoy implementing old use cases using new technology.

Which is fine; call it an experiment, call it exploration, I have no problem with that. It's frustrating to see such a stripped down article treating it like it's going to be the one, without reasonable discussions about how it could fail. There are a minimum of three failure points in this system alone, with no discussion on how to compensate for them.

thisone · on Dec 6, 2016

don't worry, the status was green the entire time...

falcolas · on Dec 6, 2016

Yes, it was indeed green the entire time. Of course, AWS is almost always green, so long as something is up...

tjholowaychuk · on Dec 6, 2016

I wrote Apex Ping (https://apex.sh/ping/) for those who want more features and/or don't want to waste the time to save a few bucks :D.

gingerlime · on Dec 6, 2016

Apex ping is great, but I'm still waiting for SMS / Twilio integration (hint, hint, nudge nudge) :)

postila · on Dec 6, 2016

I now use okmeter.io and really happy (especially for nginx and Postgres monitoring). They improve it constantly, and installation took just a couple of minutes. SMS/email/slack notifications work great (however, for slack, I needed to put a webhook).

tjholowaychuk · on Dec 6, 2016

:D I wish SMS wasn't so awkward, you're pretty much forced to have a credit system since it's so expensive. I'll probably still do it at some point. Makes it awkward for the customer as well if you have to babysit the credits

gingerlime · on Dec 6, 2016

I suggested it before, but I think you can work around it: Let your customers give you their Twilio API key (with a big disclaimer that any charges by Twilio are not your responsibility...).

tjholowaychuk · on Dec 6, 2016

If I can get costs down elsewhere I'll maybe just do "unlimited" on some of the accounts, where "unlimited" is some large arbitrary number haha.

cuu508 · on Dec 6, 2016

That would be the nicest user experience for your users, but it is a bit risky. You probably have a "reasonable number of notifications per user per month" in mind. As you sign up new users, you will sooner or later get some that will exceed that number by a lot--without a malicious intent.

nodesocket · on Dec 6, 2016

You should just allow your users to provide their own Twilio API credentials and pass the costs to the user. You can even only make this option available on your plus and pro plans, increasing your ARPU[1].

[1] https://en.wikipedia.org/wiki/Average_revenue_per_user

paps · on Dec 6, 2016

Thanks for this. I would use it but you only do HTTP requests, right? It would be great if you could also do a true ICMP ping (like the name suggests!)

tymm · on Dec 6, 2016

I wrote something similar in bash and put it into a docker image: https://hub.docker.com/r/simplepush/alerta

Just running this docker image on a server you want to monitor is enough.

Instead of Twilio it uses Simplepush (https://simplepush.io).

cyberferret · on Dec 6, 2016

Simplepush looks like a cool service - thanks for the heads up. It seems that it accomplishes the author's main need - that for a constant buzzing which needs to be picked up and dealt with.

EDIT: Just seen that it is Android only! :-/

ubercow · on Dec 6, 2016

If you need something similar that works on iOS (and Android), take a look at https://pushover.net/

I use it for some personal automation scripts that might need to get my attention if something goes wrong.

dmourati · on Dec 6, 2016

Isn't the better plan to use Lambda instead of your servers?

justinc8687 · on Dec 6, 2016

I use https://aremysitesup.com/ and I've found it really helpful as it one of the few inexpensive services I've found that will CALL me if things are down. SMS is nice, but I use the do-not-disturb feature on my phone in the evenings, and at least on iOS, the only way to punch through that is with a call from a number on my favorites list. This meets that need very well and I've found the service to be quite spot on alerting me (both when I had one instance of things hitting the fan, but also during scheduled maintenance). I'd highly recommend.

intrasight · on Dec 6, 2016

The advantage of this type of cloud solution over a one-size-fits-all cloud service like Pingdom (which I use) is flexibility. You can configure cloud agents to perform nearly any task you can envision.

gravypod · on Dec 6, 2016

I don't know if everyone knows this but you can make texts using email.

Most providers have SMTP gateways for SMS services. Verizon runs @vtext.com

gtaylor · on Dec 6, 2016

Just keep in mind that these aren't incredibly reliable across the board. Others have very low or arbitrary autoban or blacklist policies. I eventually caved and paid Twilio to hassle with SMS logistics for me, rather than deal with the weirdness.

tomascot · on Dec 6, 2016

I did this for non-critical ops, it's just a foward to my phone's email address, really simple. The problem is that not every company is reliable or even has that email

social_quotient · on Dec 6, 2016

Note: instead of dynamodb for the lookup mentioned at the bottom, maybe consider https://aws.amazon.com/athena/ for an s3 query

illumin8 · on Dec 6, 2016

You could literally just store a .CSV file in S3 with a table that has the on-call schedule in it and run SQL queries against Athena that would be cheap... you'd be querying a few KB, but DynamoDB is probably better for this use case, honestly. Athena is great for scanning huge datasets very quickly.

cagataygurturk · on Dec 6, 2016

Route53 Health Checks & SNS can send sms message without any Lambda involved.

theparanoid · on Dec 6, 2016

I've used montastic.com for years. 2min setup to fire and forget.