I’m surprised at the delay in impact detection: it took their internal health service more than five minutes to notice (or at least alert) that their main protocol’s traffic had abruptly dropped to around 10% of expected and was staying there. Without ever having been involved in monitoring at that kind of scale, I’d have pictured alarms firing for something that extreme within a minute. I’m curious for description of how and why that might be, and whether it’s reasonable or surprising to professionals in that space too.
There's a constant tension between speed of detection and false positive rates.
Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.
If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".
I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.
I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.
At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)
Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.
I work on something at a similar scale to 1.1.1.1, if we had this kind of setup our oncall would never be asleep (well, that is almost already the case, but alas). It's easy to say "just implement X monitor and you'd have caught this" but there's a real human cost and you have to work extremely vigilently at deleting monitors or you'll be absolutely swamped with endless false positive pages. I don't think a 5 minute delay is unreasonable for a service this scale.
This just seems kinda fundamental: the entire service was basically down, and it took 6+ minutes to notice? I’m just increasingly perplexed at how that could be. This isn’t an advanced monitor, this is perhaps the first and most important monitor I’d expect to implement (based on no closely relevant experience).
I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.
I'm sure some engineer at cloudflare is evaluating something like this right now, and try it on historical data how many false positives that would've generated in the past, if any.
Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.
This is hardly the first 1.1.1.1 outage. It’s also probably about the first external monitoring behaviour I imagine you’d come up with. That’s why I’m surprised—more surprised the longer I think about it, actually; more than five minutes is a really long delay to notice such a fundamental breakage.
Is your external monitor working? How many checks failed, in what order? Across how many different regions or systems? Was it a transient failure? How many times do you retry, and at what cadence? Do you push your success or failure metrics? Do you pull? What if your metrics don’t make it back? How long do you wait before considering it a problem? What other checks do you run, and how long do those take? What kind of latency is acceptable for checks like that? How many false alarms are you willing to accept, and at what cadence?
This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.
Let's say you've got a metric aggregation service, and that service crashes.
What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.
Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).
Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.
What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.
I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.
The real issue in your hypothetical scenario is a single bad metrics instance can bring the entire thing down. You could deploy multiple geographically distributed metrics aggregation services which establish the “canonical state” through a RAFT/PAXOS quorum. Then as long as a majority of metric aggregator instances are up the system will continue to work.
When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.
You need to design systems which do not rely on orchestration to remediate short transient errors.
Disclosure: I work on a core SRE team for a company with over 500 million users.
Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.
Please don’t. It doesn’t make sense, doesn’t help, doesn’t improve anything and is just waste of money, time, power and people.
Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.
Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.
And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"
I'm not convinced that the SWE crowd of HN, particularly the crowd showing up to every thread about AI 'agents' really knows what it takes to run a global network or what a NOC is. I know saying this on here runs the risk of Vint Cerf or someone like that showing up in my replies, but this is seriously getting out of hand now. Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about. This has gotten way worse since the pre-ChatGPT days.
I just mostly over the last few years tune out such responses and try not to engage them. The whole uninformed "Well, if it were me, I would simply not do that" kind of comment style has been pervasive on this site for longer than AI though, IMO.
> Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about.
It took me a very long time to realize that^. I've worked with two NOC at two huge companies, and i know they still exist as teams at those companies. I'm not an SWE, though. And I'm not certain i'd qualify either company as truly "global" except in the loosest sense - as in, one has "American" in the name of the primary subsidiary.
^ i even regularly have used "the comments were people incorrecting each other about <x>", so i knew subconsciously that HN is just a different subset of general internet comments. The issue comes from this site appearing to be moderated, and the group of people that select for commenting here seem like they would be above average at understanding and backing up claims. The "incorrecting" label comes from n-gate, which hasn't been updated since the early '20s, last i checked.
The question is, which is better: 24/7 shift work (so that someone is always at work to respond, with disrupted sleep schedules at regular planned intervals) or 24/7 on-call (with monitoring and alerting that results in random intermittent disruptions to sleep, sometimes for false positives)?
There are kinds big step/jumps as the size of a company goes up.
Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.
Step 2: You steal all the underwear.
Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.
It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.
Before you fire a quick alarm, check that the node is up, check that the service is up etc.
* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error
* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped
* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong
* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers
* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag
* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic
* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.
* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)
* traffic is actually down 90% at one datacenter because of a fiber cut somewhere
* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume
* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.
And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.
Having alarms firing within a minute just becomes a stress test for your alarm infrastructure. Is your alarm infrastructure able to get metrics and perform calculations consistently within a minute of real time?
The service almost certainly wasn't completely hard down at the time the impact began, especially if that's the start of a global rollout. It would have taken time for the impact to become measurable.