I’m surprised at the delay in impact detection: it took their internal health se...

perlgeek · 2025-07-16T07:02:13 1752649333

There's a constant tension between speed of detection and false positive rates.

Traditional monitoring systems like Nagios and Icinga have settings where they only open events/alerts if a check failed three times in a row, because spurious failed checks are quite common.

If you spam your operators with lots of alerts for monitoring checks that fix themselves, you stress the unnecessarily and create alert blindness, because the first reaction will be "let's wait if it fixes itself".

I've never operated a service with as much exposure as CF's DNS service, but I'm not really surprised that it took 8 minutes to get a reliable detection.

sbergot · 2025-07-16T08:01:20 1752652880

I work on the SSO stack in a b2b company with about 200k monthly active users. One blind spot in our monitoring is when an error occurs on the client's identity provider because of a problem on our side. The service is unusable and we don't have any error logs to raise an alert. We tried to setup an alert based on expected vs actual traffic but we concluded that it would create more problems for the reason you provided.

chrismorgan · 2025-07-16T09:04:26 1752656666

At Cloudflare’s scale on 1.1.1.1, I’d imagine you could do something comparatively simple like track ten-minute and ten-second rolling averages (I know, I know, I make that sound much easier and more practical than it actually would be), and if they differ by more than 50%, sound the alarm. (Maybe the exact numbers would need to be tweaked, e.g. 20 seconds or 80%, but it’s the idea.)

Were it much less than 1.1.1.1 itself, taking longer than a minute to alarm probably wouldn’t surprise me, but this is 1.1.1.1, they’re dealing with vasts amounts of probably fairly consistent traffic.

Anon1096 · 2025-07-16T15:57:05 1752681425

I work on something at a similar scale to 1.1.1.1, if we had this kind of setup our oncall would never be asleep (well, that is almost already the case, but alas). It's easy to say "just implement X monitor and you'd have caught this" but there's a real human cost and you have to work extremely vigilently at deleting monitors or you'll be absolutely swamped with endless false positive pages. I don't think a 5 minute delay is unreasonable for a service this scale.

chrismorgan · 2025-07-16T17:00:19 1752685219

This just seems kinda fundamental: the entire service was basically down, and it took 6+ minutes to notice? I’m just increasingly perplexed at how that could be. This isn’t an advanced monitor, this is perhaps the first and most important monitor I’d expect to implement (based on no closely relevant experience).

roughly · 2025-07-16T17:19:53 1752686393

> based on no closely relevant experience

I don’t want to devolve this to an argument from authority, but - there’s a lot of trade offs to monitoring systems, especially at that scale. Among other things, aggregation takes time at scale, and with enough metrics and numbers coming in, your variance is all over the place. A core fact about distributed systems at this scale is that something is always broken somewhere in the stack - the law of averages demands it, and so if you’re going to do an all-fire-alarm alert any time part of the system isn’t working, you’ve got alarms going off 24/7. Actually detecting that an actual incident is actually happening on a machine of the size and complexity we’re talking about within 5 minutes is absolutely fantastic.

perlgeek · 2025-07-16T11:57:00 1752667020

I'm sure some engineer at cloudflare is evaluating something like this right now, and try it on historical data how many false positives that would've generated in the past, if any.

Thing is, it's probably still some engineering effort, and most orgs only really improve their monitoring after it turned out to be sub-optimal.

chrismorgan · 2025-07-16T12:52:24 1752670344

This is hardly the first 1.1.1.1 outage. It’s also probably about the first external monitoring behaviour I imagine you’d come up with. That’s why I’m surprised—more surprised the longer I think about it, actually; more than five minutes is a really long delay to notice such a fundamental breakage.

roughly · 2025-07-16T17:22:44 1752686564

Is your external monitor working? How many checks failed, in what order? Across how many different regions or systems? Was it a transient failure? How many times do you retry, and at what cadence? Do you push your success or failure metrics? Do you pull? What if your metrics don’t make it back? How long do you wait before considering it a problem? What other checks do you run, and how long do those take? What kind of latency is acceptable for checks like that? How many false alarms are you willing to accept, and at what cadence?

briangriffinfan · 2025-07-16T12:08:44 1752667724

I would want to make sure we avoid "We should always do the exact specific thing that would have prevented this exact specific issue"-style thinking.

bombcar · 2025-07-16T12:06:15 1752667575

This is one of those graphs that would have been on the giant wall in the NOC in the old days - someone would glance up and see it had dropped and say “that’s not right” and start scrambling.

seb1204 · 2025-07-16T21:34:44 1752701684

That's how I picture it. Is that not how it is? Everyone working from home and the big chart is on the TV but someone in the family changed channels?

TheDong · 2025-07-16T06:53:19 1752648799

I'm not surprised.

Let's say you've got a metric aggregation service, and that service crashes.

What does that result in? Metrics get delayed until your orchestration system redeploys that service elsewhere, which looks like a 100% drop in metrics.

Most orchestration take a sec to redeploy in this case, assuming that it could be a temporary outage of the node (like a network blip of some sort).

Sooo, if you alert after just a minute, you end up with people getting woken up at 2am for nothing.

What happens if you keep waking up people at 2am for something that auto-resolves in 5 minutes? People quit, or eventually adjust the alert to 5 minutes.

I know you often can differentiate no data and real drops, but the overall point, of "if you page people constantly, people will quit" I think is the important one. If people keep getting paged for too tight alarms, the alarms can and should be loosened... and that's one way you end up at 5 minutes.

__turbobrew__ · 2025-07-16T20:11:57 1752696717

The real issue in your hypothetical scenario is a single bad metrics instance can bring the entire thing down. You could deploy multiple geographically distributed metrics aggregation services which establish the “canonical state” through a RAFT/PAXOS quorum. Then as long as a majority of metric aggregator instances are up the system will continue to work.

When you are building systems like 1.1.1.1 having an alert rollup of five minutes is not acceptable as it will hide legitimate downtime that lasts between 0 and 5 minutes.

You need to design systems which do not rely on orchestration to remediate short transient errors.

Disclosure: I work on a core SRE team for a company with over 500 million users.

mentalgear · 2025-07-16T06:57:38 1752649058

Its not wrong for smaller companies. But there's an argument that a big system critical company/provider like Cloudflare should be able to afford its own always on team with a night shift.

misiek08 · 2025-07-16T07:39:05 1752651545

Please don’t. It doesn’t make sense, doesn’t help, doesn’t improve anything and is just waste of money, time, power and people.

Now without crying: I saw multiple, big companies getting rid of NOC and replacing that with on duties in multiple, focused teams. Instead of 12 people sitting 24/7 in group of 4 and doing some basic analysis and steps before calling others - you page correct people in 3-5 minutes, with exact and specific alert.

Incident resolution times went greatly down (2-10x times - depends on company), people don’t have to sit overnight and sleep for most of the time and no stupid actions like service restart taken to slow down incident resolution.

And I’m not liking that some platforms hire 1500 people for job that could be done with 50-100, but in terms of incident response - if you already have teams with separated responsibilities then NOC it’s "legacy"

immibis · 2025-07-16T11:03:57 1752663837

24/7 on-call is basically mandatory at any major network, which cloudflare is. Your contractual relations with other networks will require it.

easterncalculus · 2025-07-16T15:22:12 1752679332

I'm not convinced that the SWE crowd of HN, particularly the crowd showing up to every thread about AI 'agents' really knows what it takes to run a global network or what a NOC is. I know saying this on here runs the risk of Vint Cerf or someone like that showing up in my replies, but this is seriously getting out of hand now. Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about. This has gotten way worse since the pre-ChatGPT days.

JohnMakin · 2025-07-16T16:55:18 1752684918

Lol preach

(Have worked as SRE at large global platform)

I just mostly over the last few years tune out such responses and try not to engage them. The whole uninformed "Well, if it were me, I would simply not do that" kind of comment style has been pervasive on this site for longer than AI though, IMO.

genewitch · 2025-07-16T15:50:55 1752681055

> Every HN thread that isn't about fawning over AI companies is devolving into armchair redditor analysis of topics people know nothing about.

It took me a very long time to realize that^. I've worked with two NOC at two huge companies, and i know they still exist as teams at those companies. I'm not an SWE, though. And I'm not certain i'd qualify either company as truly "global" except in the loosest sense - as in, one has "American" in the name of the primary subsidiary.

^ i even regularly have used "the comments were people incorrecting each other about <x>", so i knew subconsciously that HN is just a different subset of general internet comments. The issue comes from this site appearing to be moderated, and the group of people that select for commenting here seem like they would be above average at understanding and backing up claims. The "incorrecting" label comes from n-gate, which hasn't been updated since the early '20s, last i checked.

degamad · 2025-07-16T21:43:17 1752702197

The question is, which is better: 24/7 shift work (so that someone is always at work to respond, with disrupted sleep schedules at regular planned intervals) or 24/7 on-call (with monitoring and alerting that results in random intermittent disruptions to sleep, sometimes for false positives)?

chrismorgan · 2025-07-16T07:11:36 1752649896

Not even a night shift, just normal working hours in another part of the world.

bigiain · 2025-07-16T08:30:54 1752654654

There are kinds big step/jumps as the size of a company goes up.

Step 1: You start out with the founders being on call 27x7x365 or people in the first 10 or 20 hires "carry the pager" on weekends and evenings and your entire company is doing unpaid rostered on call.

Step 2: You steal all the underwear.

Step 3: You have follow-the-sun office-hours support staff teams distributed around the globe with sufficient coverage for vacations and unexpected illness or resignations.

chrismorgan · 2025-07-16T08:56:13 1752656173

I confess myself bemused by your Step 2.

bigiain · 2025-07-16T09:16:52 1752657412

I'm like, come on! It's a South Park reference? Surely everybody here gets that???

"Original air date: December 16, 1998"

Oh, right. Half of you weren't even born... Now I feel ooooooold.

amelius · 2025-07-16T08:38:13 1752655093

I think it is reasonable if the alarm trigger time is, say 5-10% of the time required to fix most problems.

amelius · 2025-07-16T11:48:47 1752666527

Instead of downvoting me, I'd like to know why this is not reasonable?

croemer · 2025-07-16T09:25:30 1752657930

It's not rocket science. You do a 2 stage thing: Why not check if the aggregation service has crashed before firing the alarm if it's within the first 5 minutes? How many types of false positives can there be? You just need to eliminate the most common ones and you gradually end up with fewer of them.

Before you fire a quick alarm, check that the node is up, check that the service is up etc.

sophacles · 2025-07-16T17:35:07 1752687307

> How many types of false positives can there be?

Operating at the scale of cloudflare? A lot.

* traffic appears to be down 90% but we're only getting metrics from the regions of the world that are asleep because of some pipeline error

* traffic appears to be down 90% but someone put in a firewall rule causing the metrics to be dropped

* traffic appears to be down 90% but actually the counter rolled over and prometheus handled it wrong

* traffic appears to be down 90% but the timing of the new release just caused polling to show wierd numbers

* traffic appears to be down 90% but actually there was a metrics reporting spike and there was pipeline lag

* traffic appears to be down 90% but it turns out that the team that handles transit links forgot to put the right acls around snmp so we're just not collecting metrics for 90% of our traffic

* I keep getting alerts for traffic down 90%.... thousands and thousands of them, but it turns out that really its just that this rarely used alert had some bitrot and doesn't use the aggregate metrics but the per-system ones.

* traffic is actually down 90% because theres an internet routing issue (not the dns team's problem)

* traffic is actually down 90% at one datacenter because of a fiber cut somewhere

* traffic is actually down 90% because the normal usage pattern is trough traffic volume is 10% of peak traffic volume

* traffic is down 90% from 10s ago, but 10s ago there was an unusual spike in traffic.

And then you get into all sorts of additional issues caused by the scale and distributed nature of a metrics system that monitors a huge global network of datacenters.

kccqzy · 2025-07-16T20:10:57 1752696657

Having alarms firing within a minute just becomes a stress test for your alarm infrastructure. Is your alarm infrastructure able to get metrics and perform calculations consistently within a minute of real time?

bastawhiz · 2025-07-16T22:13:40 1752704020

The service almost certainly wasn't completely hard down at the time the impact began, especially if that's the start of a global rollout. It would have taken time for the impact to become measurable.

philipwhiuk · 2025-07-16T09:22:26 1752657746

Remember they have no SLA for this service.

chrismorgan · 2025-07-16T10:25:16 1752661516

So?

They have a rather significant vested interest in it being reliable.