Hacker News new | past | comments | ask | show | jobs | submit login
Quick takes on the recent OpenAI public incident write-up (surfingcomplexity.blog)
108 points by azhenley 15 days ago | hide | past | favorite | 69 comments



> In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.

Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.


I have a 1 GiB file to rm in case of emergency on every filesystem I manage - both personal and professional. I've only had to delete it maybe three or four times that I remember, but it's been a system-saver each time. I've long considered writing a process that just consumes some CPU, memory, and bandwidth for the same emergency buffer. I've even considered that it should even examine `last` every second and free up those resources when I SSH in.


It is also common for filesystems to reserve a small percentage to the root user. I think the ext4 default is still 5% (which can be quite a bit more than 1GB on modern drives!)


I haven't used root directly for over a decade. Modern usage is to log in as an unprivileged user, and to use sudo for all root operations.


Haha, I'm going to steal that!


What’s your point here? Using sudo is using root.


If you can't log in as user because resources are reserved for root... you can't sudo.


I think you missed the joke here


I don't think they were joking?

I personally tend to leave a root shell open (and both my feet remain largely hole-free to this day), but it's pretty common advice, to avoid accidentally typing something unfortunate like rm ./* into the wrong shell


Yeah, no. If you're going to be doing `rm /` you might as well do `sudo rm /` just as easily. It's the same security model, and honestly the distinction is quite funny.


No, there's a distinction if you look again. People don't accidentally type sudo in front of commands, but people do type things into the wrong window, or the wrong tab.


> People don't accidentally type sudo in front of commands

And you're basing this assumption off...?


Why, I am acutely tuned to the hivemind, of course.


There is no joke here.


I missed the joke too. Care to share?


sudo is equivalent to root


But it’s not. It’s a subset of what root can do.

The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

Your top comment’s parent didn’t say the ssh login user had all sudo permissions. For best security, there should be many users which each have different limited permissions. Navigating the multiple `sudo su` is frustrating but has a purpose.


The entire purpose of the /etc/sudoers file is to configure which users have access to sudo and which commands they can use.

In all of my career I had seen that at one company. Everyone else just leaves is unrestricted. I would be impressed to see sudo used the way it was intended in more places. Some places even use passwordless sudo and ssh multiplexing which together with simple phishing give unfettered and unlogged access to production.


Yup. 'tune2fs -m0' has saved my bacon more than once.


We used to refer to this as a "cork file" back in the day.


Resource exhaustion can be super frustrating, and always feels like a third world slum situation when it happens.

Like, why would operating systems allow themselves to run out of headroom entirely in this day and age?


> a third world slum situation when it happens

> why would operating systems allow themselves to run out of headroom entirely in this day and age?

Following the analogy, perhaps for the same reason the richest cities in the richest, most developed Western nations, are rapidly beginning to look very much like third-world slums?

Over-optimization is the name of the game. All systems need some amount of slack in them to stay flexible, robust (and livable). Unfortunately, cutting into the slack is always profitable on the margin, so without top-down intervention, the slack will be cut into until it's gone entirely. "Look, these machines are utilized only to 75% of capacity; adding this new system will only increase it by 5%"; "look, we can save XX$/month by cutting on compute, the machines will still be maxing out at 90%, so we'll still have a buffer". "Oh, this new telemetry service will bump that only by 1%".

"Look, there's so much free space here in between these blocks of flats; adding another block won't hurt."

And so on. Until your machines are running at 99% capacity and you risk global outage every time someone sneezes near the server room. Until your city starts to look like London, and if you're from Central Europe like me, you may start to realize that being a few months or years behind on the most recent gadgets is small price to pay in exchange for cities that are affordable and clean.


They usually don’t? If they did, they would have just crashed. If they are still running, then they allowed themselves headroom. The real question is who gets to use that headroom and how? What happens when the administrative processes use “too many” resources?

There are tons of tools built into modern OSes to manage prioritization of processes, etc. Whether it’s worth the tradeoff for you to deal with the operational overhead of harnessing those features is up to you.


Because it makes computer slower and people pay for compute.

I think consumer systems do it. macos shows the "quit some app, you're out of ram" but the system itself works.

but if you are asking that OS knows this is containers process and this is control plane process and treat them differently, I think no one does that.

Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running


> OS knows this is containers process and this is control plane process and treat them differently, I think no one does that

cgroups on Linux does exactly this, and are a standard part of ensuring that containers don't exceed their allocated resources.


Most software running on w Linux server are not running in a container, which itself has (minimal) resource overhead.

Unless you are talking about VMs in cloud computing, but even in those cases the VM is usually abstracted away and the end client only sees a VPS (e.g. with EC2 or GCE).


You don't need containers to use cgroups. For instance, systemd uses cgroups to control resource allocation to various services.


OK with some configuration a server OS can do it too. Looks like ClosedAI doesn't pay sysadmins enough if they did not configure such a standard thing.


I was replying to your comment where you suggested that OS-level resource reservation wasn't a thing that happened in production systems. The situation here is that a single process was being overloaded by one type of load, preventing it from doing the other things that process was responsible for. Making that work properly would require load-shedding/prioritization within the process itself.


> Kubernetes API servers saturated because they were receiving too much traffic. Once that happened, the API servers no longer functioned properly. As a consequence, their DNS-based service discovery mechanism ultimately failed

I am not sure where it is about "one process"?


> Imagine how frustrating would it be if your os does it wrongly and you end up paying 10x because the os nerfed your main process because it thought something more important was running

Imagine how frustrating it is when the Linux on your desktop suddenly decides to start randomly killing processes, or worse, attempts to swap some memory out, causing a feedback loop of delays that completely freezes the system until you power-cycle the machine.

That's one of the two things Windows always did better (the other thing is disabling write buffering for removable storage, on the account of that storage being, well, removable).

Resource limits are not something you want to discover when you've exceeded them; they need to be managed and alerted about in advance. "Makes computer slower" and alerting the "people who pay for compute" is preferable to crashing - especially in distributed systems, where failures cascade (particularly when the whole system has been penny-pinched / overoptimized to the same degree as any single computer within it).


> That's one of the two things Windows always did better

Someone replied about cgroups on Linux and how it is bog standard stuff (that ClosedAI just didn't know how to use apparently?)


You always have a load-balancer in front of your control plane (apiservers) if you have more than 1, that's what I would use to break the glass. You need your engineers to know how to do that though.

You don't even need to break everything, take 1 apiserver out for admin access (provided etcd is not overwhelmed too).


Yep - my understanding of https://github.com/kubernetes/kubeadm/blob/main/docs/ha-cons... is that Kubernetes doesn't usually control that load balancer (nor should it, since you could accidentally tell Kubernetes to take down the control plane LB, then not be able to get it back up again!).

I suppose one could set up a "fast pass lane" kubeconfig that adds a header that haproxy would understand, and route to a priority class in its queue with e.g. https://www.haproxy.com/documentation/haproxy-configuration-... . But there's no easy `kubectl --with-priority` (or, to my knowledge, good guidelines for the various gitops solutions) that follows this pattern out of the box.


I just connect to the backend server directly after taking that backend out of the load-balanced pool. Easier than going through the load-balancer in a special way.


As I understand it you can configure api rate limiting per-user but you'd need to work out "reasonable" values on your own to allow enough headroom for admin requests.

This would require per-cluster testing (and is complex to test since you need to induce representative load) so I suppose hardly anyone does it.


There’s a whole set of features built[0] to prevent this exact scenario - runaway controllers consuming all api resources. Not sure why it didn’t work, maybe they are running an old release

[0] - https://kubernetes.io/docs/concepts/cluster-administration/f...


Microsoft SQL Server has a dedicated admin connection (DAC) feature.


In Zig you have to handle the case malloc does not succeed.


   In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.
The DNS song seems appropriate.

https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...


Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.


wouldn't it be coredns caches the information and records from API server for X amount of time (it seems like this might be 20 minutes?) then once the 20 minutes expired coredns would query api server, receive no response, then fail?

I think the idea of just serving cached responses indefinitely when api server is unreachable is what you're describing but not sure if this is default. (and probably has other tradeoffs that I'm not sure about too)


Based on my understanding of the plugin code it is the default. The way cache.Indexer works is it's continuously streaming resources from APIServer using Watch API and updates internal map. I think if Watch API is down it just sits there and doesn't purge anything but I haven't tested that. The 20 min expiry is probably referring to CodeDNS cache stanza which is a separate plugin[0].

[0] - https://coredns.io/plugins/cache


I had the same thought! https://news.ycombinator.com/item?id=42446318

My guess is they were running CoreDNS on control plane nodes since that's the kubeadm default.


I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.


> 20,000 line script

Dude.


They meant manuscript I assume.


Yeah, "script" is just an abbreviation of "manuscript", right? ;-)


Right, and the (manu)script that size makes a long scroll. Humanity had to more or less invent pagination to deal with this.

(I'll see myself out.)


Thanks dad. ;-)

Job well done.


This was most definitely not my choice or preference, but you know how it goes.


This quote cracked me up:

“I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”


James Mickens is a comedic genius. The linked article always makes me laugh out loud.

https://www.usenix.org/system/files/1311_05-08_mickens.pdf


Excellent article, but the typesetting / justification in that pdf is horrendous.

The text columns look like the side of that hallway rubber mat that my dog keeps chewing on.

  spend a lot of time trying
  edge. However, as someone
  lieve that true progress is
  mes, and for the chickens
  y zombies, and the polite
  to eat your brain to acquire
  be prepared; thus, in the
  e scientific breakthroughs,
  ast inevitably becomes
  he main thing that I ponder is
  post-apocalyptic survival
  ag-tag group of associates.
  cruit: a locksmith (to open
  ith has run out of ideas);
  row snakes at my enemies
  g is a reasonable way to
  ble in my ultimate success



That was an amazing read. Thanks for linking it.


Wow, sounds like a nightmare. Operations staff definitely have real jobs.


Recent and related:

ChatGPT Down - https://news.ycombinator.com/item?id=42394391 - Dec 2024 (30 comments)


splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.

maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.


> maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

That is generally how it is done... there's this never ending conflict between "push" architectures and "pull" architectures, and this scenario sure makes "push" seem better, and it is... until you're in one of those scenarios where "pull" is better. ;-)


I imagine a peer to peer scheme could also work. All nodes are control plane and data plane and advertise/broadcast to other nodes (similar to how network routing tables are distributed)

That'd be a pretty big architectural change, though


Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.

Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.


For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”

How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.


In the words of DevOps Borat...

"To make error is human. To propagate error to all server in automatic way is #devops."


So sad that the account isn’t posting anymore.


DNS makes for an atypically slow feedback loop. If you're not aware of it, then for an otherwise safe-looking change, you may test and complete the gradual roll-out before the failure hits you.


Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback


The only problem is you have to find out where those pods are and your primary source of this information is currently under dos attack by said pods


Not if you connect to all the nodes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: