Quick takes on the recent OpenAI public incident write-up

btown · 2024-12-16T23:34:58 1734392098

> In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.

dotancohen · 2024-12-17T08:50:54 1734425454

I have a 1 GiB file to rm in case of emergency on every filesystem I manage - both personal and professional. I've only had to delete it maybe three or four times that I remember, but it's been a system-saver each time. I've long considered writing a process that just consumes some CPU, memory, and bandwidth for the same emergency buffer. I've even considered that it should even examine `last` every second and free up those resources when I SSH in.

tux3 · 2024-12-17T10:16:55 1734430615

It is also common for filesystems to reserve a small percentage to the root user. I think the ext4 default is still 5% (which can be quite a bit more than 1GB on modern drives!)

dotancohen · 2024-12-17T12:25:00 1734438300

I haven't used root directly for over a decade. Modern usage is to log in as an unprivileged user, and to use sudo for all root operations.

tucnak · 2024-12-17T13:45:54 1734443154

Haha, I'm going to steal that!

skywhopper · 2024-12-17T13:12:39 1734441159

What’s your point here? Using sudo is using root.

remram · 2024-12-17T16:28:02 1734452882

If you can't log in as user because resources are reserved for root... you can't sudo.

tucnak · 2024-12-17T13:46:08 1734443168

I think you missed the joke here

tux3 · 2024-12-17T14:30:00 1734445800

I don't think they were joking?

I personally tend to leave a root shell open (and both my feet remain largely hole-free to this day), but it's pretty common advice, to avoid accidentally typing something unfortunate like rm ./* into the wrong shell

tucnak · 2024-12-17T16:47:52 1734454072

Yeah, no. If you're going to be doing `rm /` you might as well do `sudo rm /` just as easily. It's the same security model, and honestly the distinction is quite funny.

tux3 · 2024-12-17T17:48:25 1734457705

No, there's a distinction if you look again. People don't accidentally type sudo in front of commands, but people do type things into the wrong window, or the wrong tab.

tucnak · 2024-12-17T18:45:43 1734461143

> People don't accidentally type sudo in front of commands

And you're basing this assumption off...?

tux3 · 2024-12-17T22:04:03 1734473043

Why, I am acutely tuned to the hivemind, of course.

emptiestplace · 2024-12-17T14:52:33 1734447153

There is no joke here.

dotancohen · 2024-12-17T15:49:21 1734450561

I missed the joke too. Care to share?

tucnak · 2024-12-17T16:46:04 1734453964

sudo is equivalent to root

thephyber · 2024-12-17T17:38:20 1734457100

But it’s not. It’s a subset of what root can do.

The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

Your top comment’s parent didn’t say the ssh login user had all sudo permissions. For best security, there should be many users which each have different limited permissions. Navigating the multiple `sudo su` is frustrating but has a purpose.

LinuxBender · 2024-12-18T01:56:28 1734486988

The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

In all of my career I had seen that at one company. Everyone else just leaves is unrestricted. I would be impressed to see sudo used the way it was intended in more places. Some places even use passwordless sudo and ssh multiplexing which together with simple phishing give unfettered and unlogged access to production.

timschmidt · 2024-12-17T11:38:16 1734435496

Yup. 'tune2fs -m0' has saved my bacon more than once.

hank808 · 2024-12-17T15:30:59 1734449459

We used to refer to this as a "cork file" back in the day.

tcdent · 2024-12-16T23:51:12 1734393072

Resource exhaustion can be super frustrating, and always feels like a third world slum situation when it happens.

Like, why would operating systems allow themselves to run out of headroom entirely in this day and age?

TeMPOraL · 2024-12-17T07:54:12 1734422052

> a third world slum situation when it happens

> why would operating systems allow themselves to run out of headroom entirely in this day and age?

Following the analogy, perhaps for the same reason the richest cities in the richest, most developed Western nations, are rapidly beginning to look very much like third-world slums?

Over-optimization is the name of the game. All systems need some amount of slack in them to stay flexible, robust (and livable). Unfortunately, cutting into the slack is always profitable on the margin, so without top-down intervention, the slack will be cut into until it's gone entirely. "Look, these machines are utilized only to 75% of capacity; adding this new system will only increase it by 5%"; "look, we can save XX$/month by cutting on compute, the machines will still be maxing out at 90%, so we'll still have a buffer". "Oh, this new telemetry service will bump that only by 1%".

"Look, there's so much free space here in between these blocks of flats; adding another block won't hurt."

And so on. Until your machines are running at 99% capacity and you risk global outage every time someone sneezes near the server room. Until your city starts to look like London, and if you're from Central Europe like me, you may start to realize that being a few months or years behind on the most recent gadgets is small price to pay in exchange for cities that are affordable and clean.

skywhopper · 2024-12-17T13:16:51 1734441411

They usually don’t? If they did, they would have just crashed. If they are still running, then they allowed themselves headroom. The real question is who gets to use that headroom and how? What happens when the administrative processes use “too many” resources?

There are tons of tools built into modern OSes to manage prioritization of processes, etc. Whether it’s worth the tradeoff for you to deal with the operational overhead of harnessing those features is up to you.

throwaway290 · 2024-12-17T02:06:07 1734401167

Because it makes computer slower and people pay for compute.

I think consumer systems do it. macos shows the "quit some app, you're out of ram" but the system itself works.

but if you are asking that OS knows this is containers process and this is control plane process and treat them differently, I think no one does that.

Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

yuliyp · 2024-12-17T02:31:41 1734402701

> OS knows this is containers process and this is control plane process and treat them differently, I think no one does that

cgroups on Linux does exactly this, and are a standard part of ensuring that containers don't exceed their allocated resources.

dotancohen · 2024-12-17T08:55:27 1734425727

Most software running on w Linux server are not running in a container, which itself has (minimal) resource overhead.

Unless you are talking about VMs in cloud computing, but even in those cases the VM is usually abstracted away and the end client only sees a VPS (e.g. with EC2 or GCE).

yuliyp · 2024-12-17T16:45:50 1734453950

You don't need containers to use cgroups. For instance, systemd uses cgroups to control resource allocation to various services.

throwaway290 · 2024-12-17T03:00:01 1734404401

OK with some configuration a server OS can do it too. Looks like ClosedAI doesn't pay sysadmins enough if they did not configure such a standard thing.

yuliyp · 2024-12-17T16:44:23 1734453863

I was replying to your comment where you suggested that OS-level resource reservation wasn't a thing that happened in production systems. The situation here is that a single process was being overloaded by one type of load, preventing it from doing the other things that process was responsible for. Making that work properly would require load-shedding/prioritization within the process itself.

throwaway290 · 2024-12-17T16:50:21 1734454221

> Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed

I am not sure where it is about "one process"?

TeMPOraL · 2024-12-17T08:22:31 1734423751

> Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

Imagine how frustrating it is when the Linux on your desktop suddenly decides to start randomly killing processes, or worse, attempts to swap some memory out, causing a feedback loop of delays that completely freezes the system until you power-cycle the machine.

That's one of the two things Windows always did better (the other thing is disabling write buffering for removable storage, on the account of that storage being, well, removable).

Resource limits are not something you want to discover when you've exceeded them; they need to be managed and alerted about in advance. "Makes computer slower" and alerting the "people who pay for compute" is preferable to crashing - especially in distributed systems, where failures cascade (particularly when the whole system has been penny-pinched / overoptimized to the same degree as any single computer within it).

throwaway290 · 2024-12-17T09:34:54 1734428094

> That's one of the two things Windows always did better

Someone replied about cgroups on Linux and how it is bog standard stuff (that ClosedAI just didn't know how to use apparently?)

remram · 2024-12-17T16:30:30 1734453030

You always have a load-balancer in front of your control plane (apiservers) if you have more than 1, that's what I would use to break the glass. You need your engineers to know how to do that though.

You don't even need to break everything, take 1 apiserver out for admin access (provided etcd is not overwhelmed too).

btown · 2024-12-17T17:19:18 1734455958

Yep - my understanding of https://github.com/kubernetes/kubeadm/blob/main/docs/ha-cons... is that Kubernetes doesn't usually control that load balancer (nor should it, since you could accidentally tell Kubernetes to take down the control plane LB, then not be able to get it back up again!).

I suppose one could set up a "fast pass lane" kubeconfig that adds a header that haproxy would understand, and route to a priority class in its queue with e.g. https://www.haproxy.com/documentation/haproxy-configuration-... . But there's no easy `kubectl --with-priority` (or, to my knowledge, good guidelines for the various gitops solutions) that follows this pattern out of the box.

remram · 2024-12-17T18:57:16 1734461836

I just connect to the backend server directly after taking that backend out of the load-balanced pool. Easier than going through the load-balancer in a special way.

xyzzy123 · 2024-12-17T05:13:55 1734412435

As I understand it you can configure api rate limiting per-user but you'd need to work out "reasonable" values on your own to allow enough headroom for admin requests.

This would require per-cluster testing (and is complex to test since you need to induce representative load) so I suppose hardly anyone does it.

dilyevsky · 2024-12-17T06:56:40 1734418600

There’s a whole set of features built[0] to prevent this exact scenario - runaway controllers consuming all api resources. Not sure why it didn’t work, maybe they are running an old release

[0] - https://kubernetes.io/docs/concepts/cluster-administration/f...

jiggawatts · 2024-12-17T10:51:32 1734432692

Microsoft SQL Server has a dedicated admin connection (DAC) feature.

cies · 2024-12-17T11:58:23 1734436703

In Zig you have to handle the case malloc does not succeed.

geocrasher · 2024-12-17T07:52:20 1734421940

   In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.

The DNS song seems appropriate.

https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...

dilyevsky · 2024-12-16T21:42:17 1734385337

Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.

jimmyl02 · 2024-12-16T21:54:08 1734386048

wouldn't it be coredns caches the information and records from API server for X amount of time (it seems like this might be 20 minutes?) then once the 20 minutes expired coredns would query api server, receive no response, then fail?

I think the idea of just serving cached responses indefinitely when api server is unreachable is what you're describing but not sure if this is default. (and probably has other tradeoffs that I'm not sure about too)

dilyevsky · 2024-12-16T22:06:13 1734386773

Based on my understanding of the plugin code it is the default. The way cache.Indexer works is it's continuously streaming resources from APIServer using Watch API and updates internal map. I think if Watch API is down it just sits there and doesn't purge anything but I haven't tested that. The 20 min expiry is probably referring to CodeDNS cache stanza which is a separate plugin[0].

[0] - https://coredns.io/plugins/cache

dmauskop · 2024-12-18T00:27:41 1734481661

I had the same thought! https://news.ycombinator.com/item?id=42446318

My guess is they were running CoreDNS on control plane nodes since that's the kubeadm default.

JohnMakin · 2024-12-16T21:19:37 1734383977

I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.

dboreham · 2024-12-17T00:23:01 1734394981

> 20,000 line script

Dude.

fuzzy_biscuit · 2024-12-17T00:42:05 1734396125

They meant manuscript I assume.

cbsmith · 2024-12-17T04:40:12 1734410412

Yeah, "script" is just an abbreviation of "manuscript", right? ;-)

TeMPOraL · 2024-12-17T08:27:57 1734424077

Right, and the (manu)script that size makes a long scroll. Humanity had to more or less invent pagination to deal with this.

(I'll see myself out.)

cbsmith · 2024-12-17T17:22:31 1734456151

Thanks dad. ;-)

Job well done.

JohnMakin · 2024-12-17T19:48:51 1734464931

This was most definitely not my choice or preference, but you know how it goes.

StarlaAtNight · 2024-12-17T03:14:06 1734405246

This quote cracked me up:

“I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”

akshayshah · 2024-12-17T06:27:42 1734416862

James Mickens is a comedic genius. The linked article always makes me laugh out loud.

https://www.usenix.org/system/files/1311_05-08_mickens.pdf

rrr_oh_man · 2024-12-17T10:12:49 1734430369

Excellent article, but the typesetting / justification in that pdf is horrendous.

The text columns look like the side of that hallway rubber mat that my dog keeps chewing on.

  spend a lot of time trying
  edge. However, as someone
  lieve that true progress is
  mes, and for the chickens
  y zombies, and the polite
  to eat your brain to acquire
  be prepared; thus, in the
  e scientific breakthroughs,
  ast inevitably becomes
  he main thing that I ponder is
  post-apocalyptic survival
  ag-tag group of associates.
  cruit: a locksmith (to open
  ith has run out of ideas);
  row snakes at my enemies
  g is a reasonable way to
  ble in my ultimate success

remram · 2024-12-17T21:24:30 1734470670

Converted to markdown: https://gist.github.com/remram44/767dd41676866cc2d30dae6e3a1...

hyperdimension · 2024-12-17T14:03:48 1734444228

That was an amazing read. Thanks for linking it.

ilaksh · 2024-12-16T21:54:29 1734386069

Wow, sounds like a nightmare. Operations staff definitely have real jobs.

dang · 2024-12-16T20:25:02 1734380702

Recent and related:

ChatGPT Down - https://news.ycombinator.com/item?id=42394391 - Dec 2024 (30 comments)

jimmyl02 · 2024-12-16T21:34:41 1734384881

splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.

maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

cbsmith · 2024-12-17T04:41:55 1734410515

> maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

That is generally how it is done... there's this never ending conflict between "push" architectures and "pull" architectures, and this scenario sure makes "push" seem better, and it is... until you're in one of those scenarios where "pull" is better. ;-)

nijave · 2024-12-17T13:30:40 1734442240

I imagine a peer to peer scheme could also work. All nodes are control plane and data plane and advertise/broadcast to other nodes (similar to how network routing tables are distributed)

That'd be a pretty big architectural change, though

ec109685 · 2024-12-17T10:16:53 1734430613

Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.

Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.

feyman_r · 2024-12-17T05:20:16 1734412816

For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”

How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.

antod · 2024-12-17T06:43:27 1734417807

In the words of DevOps Borat...

"To make error is human. To propagate error to all server in automatic way is #devops."

Havoc · 2024-12-17T12:35:50 1734438950

So sad that the account isn’t posting anymore.

TeMPOraL · 2024-12-17T08:33:18 1734424398

DNS makes for an atypically slow feedback loop. If you're not aware of it, then for an otherwise safe-looking change, you may test and complete the gradual roll-out before the failure hits you.

nijave · 2024-12-17T06:39:06 1734417546

Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback

dilyevsky · 2024-12-17T07:00:12 1734418812

The only problem is you have to find out where those pods are and your primary source of this information is currently under dos attack by said pods

nijave · 2024-12-17T13:26:01 1734441961

Not if you connect to all the nodes