From a engineering point of view this is really cool, but as an ex-sysadmin I feel that I need to reiterate and emphasize something that is alluded to in the second paragraph.
Too many things can go wrong and you are all around better off outsourcing this to something like Pingdom. You don't have sufficient levels of reliability, you aren't dual homed across twilio and another phone system. Maybe the cause of your outage is that AWS is having issues. Now your site and your monitoring is down.
Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.
It's the same as "Twitter clone" with just posting messages with 140 char limit and "build a blog in 15 minutes".
Alerting over a downed website are is sorta like a glacier, there's so much under the surface, if you just see the surface you're missing out.
1. Multiple locations
2. Multiple check intervals
3. SMS/email provider switch on fail
4. Auto recovery of your checkers
5. Multiple providers with a single storage.
> Now your site and your monitoring is down. Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.
With your own solution you will likely encounter the same problems that pingdom faced including this one. The benefit of a service like pingdom is that they already solved those problems for you or if they haven't you don't have to waste time solving them yourself. It's not very efficient if everyone solves the same problems over and over again.
My favorite issue recently came up with a Django app of mine which was set up to email me when a request errors out. Turns out, when I switched which server it ran on I misconfigured the email settings and one of the errors was caused due to the inability to send an email. Thankfully it only took a few days to figure this out.
We’ve had issues with Pingdom at work. We don’t use them ourselves, but we host web sites, and some customer of ours used Pingdom to monitor their web site hosted on our servers. The customer would complain to us about downtime reported by Pingdom, but we would read the logs and find everything OK, with multiple successful accesses from other people during the time which Pingdom reported our customer’s site as being down. A huge pain.
Doesn't services like Pingdom support multiple ping locations? If all of those fail, there's a very high chance there's an actual problem, if not with your server then with your (ISPs) connectivity.
The question is what customers of monitoring systems expect from the monitoring. Do pingdoms explain what a failure means, or are they only providing data and then its up to the customer to interpret that data.
Multiple ping locations is helpful in bringing more data points, but it doesn't address the problem of explaining what the data means. For example, pingdom could provide triangulation of the failure if fault identification was part of the businesses model of monitoring.
I would describe the criticism of pingdom as a failure of expectations. Pingdom is not a security service, a monitoring service, or fault identification service. They are a single test, and the data you get back is useless unless interpreted and verified.
If our ISP was down, we would not have had successful accesses from other people at the same time. If some transit ISP was down somewhere between us and Pingdom, well, that’s the Internet for you, eh? Regardless, Pingdom would report us as down, even though we weren’t at fault.
Yes you were down for some of your users. If that's ok for you that's fine. But if I were you I would be calling my ISP and trying to sort out why customers from location X can't access but customers from location Y can.
If you're providing a service to your users, and they say that the service is down using pingdom, you should be looking into, not just saying "Works on my machine".
Why should we be the ones to look into it? It was a random intermittent short-duration fault in the middle if the Internet, at some unknown place on the then-current path between us and Pingdom. Why should not Pingdom be at least equally as obligated to look into it? After all, they’re the ones actually using the failing connection, in order to monitor our and others’ services. But no, Pingdom simply report us as being down, and leave the hard part to us; i.e. the part where we have to explain to our customers that the Pingdom report is actually provably incorrect.
I mean, what qualifies as “being up”? If some random link in the middle of the Internet goes down, and you suddenly, for 30 seconds, are unreachable for the few hundred people going through that exact link because it happens to be the best path between those people and your server, can they claim that you have failed to provide adequate uptime? If such a fault happens, are you then responsible to troubleshoot it? I say no. The Internet is the ISP’s responsibility, and the only faults actually meaningful to report to your ISP are the repeatable or long-lasting ones. Small stuff like this is not worth anybody’s time (except ISPs) to go digging into.
Well if you're not providing a service to others, then you shouldn't be the ones to look into it. But if you're providing a service to users and they tell you it's down then you should. It might be that your ISP has a misconfigured route that is flapping and sometimes causes errors in some locations. Or a netmask is wrong somewhere and certain ip address can't be accessed. It might not be a temporary thing. And you if it's your ISP fault they might be able to fix it.
You've seem to think that you have to investigate the issues. On the contrary, you bump it up to your isp to investigate. If your ISP is regularly having these issues then it might be time to change ISPs to one with a better peering agreement.
We've seen the same with Pingdom. ISPs are (mostly) multi-homed and Pingdom might use just one (affected) route to the ISP in question and then fail badly.
If Pingdom can't get to your site, it's highly likely your users can't either.
That was not the case with us; Pingdom would report short outages, like a few seconds here, a couple of minutes there, and only a handful of occurrences for the whole report duration (IIRC).
On the other hand, it's 'set up once and it just keeps chugging along', and isn't Yet Another SaaS To Manage.
Also, if you want a 'proper' ops alerting SaaS, you're looking at something along the lines of $50/user/mo or $15/server/mo, neither of which is trivial.
Yeah assuming nothing falls apart with the custom implementation maintenance-wise. Programmers have a hard time focusing on their real goals though, we often re-implement things that really aren't worth the time or money.
Well, sure, of course it will, but I don't think Nick is advocating replacing a complete, full featured monitoring system with this.
It could be very useful to, for example, keep an eye on your monitoring system. At $work, we have a pretty extensive monitoring system that we've built out. We use an external service to watch over the monitoring system, though, to alert us of any issues with it that we haven't otherwise caught.
Of course it can go down, and you can have CloudWatch alerts to alert you about that. But so can your Nagios server sending pings go down or the fancy SaaS you signed up for.
Yes, because they're different services running on different architecture and distributed differently. I challenge you to find one time in the past five years where CloudWatch was down at the same time as other services. Even if you can, I'm sure your custom built Nagios server in your datacenter has gone down as many or more times too.
But my bigger point here is that you're essentially asking "well how do you monitor your monitor?" At which point up the chain do you have enough? Also, I think the original post was simply a demo of what is possible. Yet whenever someone posts something, people go in the comments to belittle it. "Yeah, you built a monitoring solution... Well what happens if that goes down?"
Which is a legitimate question. But obviously if your production service is that critical to your business, you won't be monitoring it with a service that costs $0.0000002 per execution.
> they're different services running on different architecture and distributed differently
I think you underestimate the interdependency of services in AWS. Historically, if there were problems with S3 or EBS in us-east-1, you could expect the entire API to be flaky, and things like autoscaling to fail. These have been better distributed, but failures still cascade.
> I think the original post was simply a demo of what is possible
No, it wasn't a demo, it was an actual production issue. No alarms, no error logs, no way to tell it wasn't working other than someone noticing the queues were getting larger and contacting AWS.
> people go in the comments to belittle it
Only because the original project projects AWS Lambda as "the solution" for such problems, not realizing that it is just as fallible a solution as everything else.
> Well what happens if that goes down?
The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.
> But obviously if your production service is that critical to your business, you won't be monitoring it with [this] service
Then what's it's value, other than as an intellectual exercise?
Launch two Lambda functions, heck, 8 Lambda functions, one in each AWS region that supports it. They all monitor one another, plus run your checks. Next, are you going to say all 8 regions will go down at once?
The whole setup will still cost $0/month.
> The solution to this is well known - two monitoring systems in physically separate locations that monitor each other as well as mission critical systems. Nagios, Icinga, and a dozen other well-tested solutions work remarkably well for these roles, yet people keep writing "new" solutions over and over and over.
Because not everyone needs heavy solutions to do something simple. Side projects, small sites, etc. And some people enjoy implementing old use cases using new technology. When Go was rising in popularity, half the posts on the front page were re-implementing fairly common features in Go.
Even if you're not going to implement this yourself, there can still be some value for other readers.
> are you going to say all 8 regions will go down at once
I hope not. But then it's not just Lambda triggered by cloudwatch alarms anymore. You'd probably have to set up something to ensure that Lambda, when called via cloudwatch alarms, is being triggered properly. Useful, but suddenly a lot more complicated.
> The whole setup will still cost $0/month.
Unlikely. A small amount, but certainly not 0. Especially when you start adding Lambda heartbeats.
> And some people enjoy implementing old use cases using new technology.
Which is fine; call it an experiment, call it exploration, I have no problem with that. It's frustrating to see such a stripped down article treating it like it's going to be the one, without reasonable discussions about how it could fail. There are a minimum of three failure points in this system alone, with no discussion on how to compensate for them.
I now use okmeter.io and really happy (especially for nginx and Postgres monitoring). They improve it constantly, and installation took just a couple of minutes. SMS/email/slack notifications work great (however, for slack, I needed to put a webhook).
:D I wish SMS wasn't so awkward, you're pretty much forced to have a credit system since it's so expensive. I'll probably still do it at some point. Makes it awkward for the customer as well if you have to babysit the credits
I suggested it before, but I think you can work around it: Let your customers give you their Twilio API key (with a big disclaimer that any charges by Twilio are not your responsibility...).
That would be the nicest user experience for your users, but it is a bit risky. You probably have a "reasonable number of notifications per user per month" in mind. As you sign up new users, you will sooner or later get some that will exceed that number by a lot--without a malicious intent.
You should just allow your users to provide their own Twilio API credentials and pass the costs to the user. You can even only make this option available on your plus and pro plans, increasing your ARPU[1].
Thanks for this. I would use it but you only do HTTP requests, right? It would be great if you could also do a true ICMP ping (like the name suggests!)
Simplepush looks like a cool service - thanks for the heads up. It seems that it accomplishes the author's main need - that for a constant buzzing which needs to be picked up and dealt with.
I use https://aremysitesup.com/ and I've found it really helpful as it one of the few inexpensive services I've found that will CALL me if things are down. SMS is nice, but I use the do-not-disturb feature on my phone in the evenings, and at least on iOS, the only way to punch through that is with a call from a number on my favorites list. This meets that need very well and I've found the service to be quite spot on alerting me (both when I had one instance of things hitting the fan, but also during scheduled maintenance). I'd highly recommend.
The advantage of this type of cloud solution over a one-size-fits-all cloud service like Pingdom (which I use) is flexibility. You can configure cloud agents to perform nearly any task you can envision.
Just keep in mind that these aren't incredibly reliable across the board. Others have very low or arbitrary autoban or blacklist policies. I eventually caved and paid Twilio to hassle with SMS logistics for me, rather than deal with the weirdness.
I did this for non-critical ops, it's just a foward to my phone's email address, really simple.
The problem is that not every company is reliable or even has that email
You could literally just store a .CSV file in S3 with a table that has the on-call schedule in it and run SQL queries against Athena that would be cheap... you'd be querying a few KB, but DynamoDB is probably better for this use case, honestly. Athena is great for scanning huge datasets very quickly.
Too many things can go wrong and you are all around better off outsourcing this to something like Pingdom. You don't have sufficient levels of reliability, you aren't dual homed across twilio and another phone system. Maybe the cause of your outage is that AWS is having issues. Now your site and your monitoring is down.
Much better to outsource to people who obsess over doing this right and making sure they are properly redundant.