Hacker News new | past | comments | ask | show | jobs | submit login
The server chose violence (cliffle.com)
306 points by lukastyrychtr 6 months ago | hide | past | favorite | 62 comments



> ‘But REPLY_FAULT also provides a way to define and implement new kinds of errors — application-specific errors — such as access control rules. For instance, the Hubris IP stack assigns IP ports to tasks statically. If a task tries to mess with another task’s IP port, the IP stack faults them. This gets us the same sort of “fail fast” developer experience, with the smaller and simpler code that results from not handling “theoretical” errors that can’t occur in practice.‘

This sounds good when the system is small and tight and applications are written mostly by people who designed the whole system.

But as an application developer, I’d be somewhat scared to interface with third-party code over an IPC model where the other service can at any time send back an instant death pill to my process.

I guess I just don’t trust other app developers that much. The world is full of terrible drivers and background processes written by stressed-out developers harassed by management. They’ll drop in a bunch of potentially unsuitable default REPLY_FAULTs if it means they get to go home before 8pm.


> This sounds good when the system is small and tight and applications are written mostly by people who designed the whole system.

I think that's intentional because that's what Hubris is aimed at.


...and in that circumstance, the author reports finding, apparently serendipitously, that it helped with development: "Initially I was concerned that I’d made the kernel too aggressive, but in practice, this has meant that errors are caught very early in development. A fault is hard to miss, and literally cannot be ignored the way an error code might be."


Indeed, this happened with Symbian. An IPC server could panic the client. As an application developer without access to the OS source code this was pretty terrible. Not all preconditions were easily understood and could vary between devices and OS versions.


> This sounds good when the system is small and tight and applications are written mostly by people who designed the whole system.

Swift death to deviance is a way to keep the system tight. The designed scope probably keeps it small anyway. Scopes have a way of creeping, but I don't think people will want to force tasks into Hubris that would be better on the host rather than in its embedded controllers.


It seems like in an embedded environment, it's good to resolve these misunderstandings immediately when they occur, regardless of whose fault it is.

The server says "that client is bad!" so the kernel kills it. The problem is really that the two didn't understand each other.


> But as an application developer, I’d be somewhat scared to interface with third-party code over an IPC model where the other service can at any time send back an instant death pill to my process.

For service, think "OS interface". If you make a bogus kernel call on a monolithic kernel, it would be reasonable for the OS to kill you. Also note that when you say "process" it might be different than you think because threads all share the same address space on hubris.


> not handling “theoretical” errors that can’t occur in practice

The Dennis Nedry approach to counting dinosaurs in Jurassic Park.


To be fair, this is how abort works in any library you call into that’s in your process.


Does REPLY_FAULT cascade? Meaning, if A is waiting in a SEND to B, and B is waiting in a SEND to C, and C does REPLY_FAULT, does A get killed along with B (and any further tasks that may be waiting on A)? Because if not, a malicious task could just delegate its experiments to a helper task. And if yes, that seems rather brittle overall (without having any further familiarity with Hubris). Furthermore, if SENDs can be circular/reciprocal, a task may also inadvertently kill itself that way — which (for scenarios like B –> A –> B) may incentivize not using REPLY_FAULT.


It seems that Hubris is not designed as a general-purpose operating system. Processes are defined at build time.

The reason why servers can shoot back at their clients is reliability, not security. Errors are thought to originate from bugs, not from deliberate attacks. The extreme reaction of the kernel ensures that developers find them as soon as possible.

Of course, there is an overlap with security, and this can be a useful fallback measure in the event that a process tries to do something that it isn't supposed to do.


> It seems that Hubris is not designed as a general-purpose operating system. Processes are defined at build time.

These are both correct.

Well, I mean, Hubris is general in the sense that, if you're doing an embedded system and you can deal with the constraints it has, like the latter, it can work for your projects. But it's not trying to be anything other than a good embedded OS, or to handle any project.


I think when B gets faulted A would get an error about a dead server and would have the opportunity resend the same message to a newly reset server not a cascading crash.


Hubris and Humility (its debugger) are two pieces of tech I would love to be deeply engrossed in if I had the time (or the mandate). But alas, that is not possible.


It’s interesting how in a system where one team writes all the code, nuking your clients from orbit when they look at you funny can improve iteration speed.

It’s funny to wake up and read this after falling asleep reading about algebraic effects.

If you squint the right way, this is a kernel that lets a server perform an effect that the client cannot handle.

I feel like this would make code reuse and composition much harder, but provides a much simpler execution model. Definitely the right trade off in a static embedded system. You can always just vendor and modify a task if you need to reuse it.


I don't think this will make reuse much worse even in a general programs, as long as there is a good division between expected errors (file not found) and unexpected (invalid operation code). In fact, there are a lot of ignorable errors in Unix which IMHO should have been raising a fatal signal instead, as this would substantially improve general software quality.

As an example: trying to close() invalid FD is a a non-fatal error which is very often ignored. But it is actually super dangerous, especially in multi-threaded apps: closing wrong fd will harmlessly fail most of the time, but 1% of time you'll close a logging socket or a database lock file or some unrelated IPC connection.. That's how you get unreliable software everyone hates.


I agree with you in general.

However, in your example it’s the kernel that is deciding the request (message) is bad. In Hubris it is the message receiver.

This is a bit contrived, but imagine you’re receiving some stringly typed data from an external source and sending a message to a parsing task that either throws or messages you back with a list of some type t. Maybe it is returning ints and you as the client know that if something isn’t parsable as an int you want it to treat it as a ‘0’ because you’re summing the list. Somewhere else you want to call the same task, but you want strings that can’t be parsed to be treated as ‘1’ unless they can’t be parsed due to overflow (in which case you rethrow) because you’re taking the product.

In some situations it’s natural for the client to know more than the server about how to handle errors. With this nuke from orbit model, there’s some forced coupling between the client and server (mutual agreement over what causes a REPLY_FAULT).


Reminded me of the line from Errand of Mercy, "You will find there are many rules and regulations. They will be posted. Violation of the smallest of them will be punished by death."


OK, we need to get this as an April Fools RFC for HTTP.

I propose HTTP 499 “Shame on you.” A client receiving 499 (perhaps on a request that it must have originated with a specific header like “Strict: true”) must terminate, in a language-dependent manner, the task which issued the request.

It perfectly balances the “WTF... But actually, hey” that one sees in those contexts.


Very enjoyable read, and this single supervisor is similar to how I set up an application at a previous startup, where we unwrapped everything. This reminds me of one of my favorite posts https://medium.com/@mattklein123/crash-early-and-crash-often...


I’m wondering if this really is too aggressive.

On Linux, sure it’s not possible to directly crash another program you’re talking to via a socket alone (ignoring bad data on the socket).

But you can absolutely kill them. Anything running as root can kill anything else. Can even reboot and bring down the whole system.

Maybe a bit harder and a bit more unusual, but at least for containers, root privileges are common. And yeah, sure, there’s a cgroup there are you’re more limited. But you get the idea.

It’s also a bit different from the (conventional?) wisdom about being “liberal in what you accept, conservative in what you emit” though that’s a bit more tied to networked systems.

Though, maybe it’s inevitable that a system has to be liberal in what they accept.

How else can you change the api slightly without breaking existing programs?


Hubris isn't a general-purpose OS, it runs on a low-level processor inside the Oxide server rack. I believe Hubris doesn't even allow new kinds of processes at runtime; all possible executables must be determined at compile time.


> I believe Hubris doesn't even allow new kinds of processes at runtime; all possible executables must be determined at compile time.

Correct. From [0]:

"Hubris is an aggressively static system. The configuration file defines the full set of tasks that may ever be running in the application. These tasks are assigned to sections of address space by the build system, and they will forever occupy those sections.

Hubris has no operations for creating or destroying tasks at runtime. Task resource requirements are determined during the build and are fixed once deployed. This takes the kernel out of the resource allocation business. Memories are the most visible resources we handle this way, but it applies to all allocatable or routable resources, including hardware interrupts and memory-mapped registers – all are explicitly wired up at compile time and cannot be changed at runtime."

[0] https://cliffle.com/blog/on-hubris-and-humility/


> I believe Hubris doesn't even allow new kinds of processes at runtime; all possible executables must be determined at compile time.

This is correct, yes.


"Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away."


> There is no way to “fix” the problem and resume the task. This was a conscious choice to avoid some subtle failure modes and simplify reasoning about the system.

One of Einstein's famous quotes is, "...as simple as possible, but no simpler." I'm pretty sure this design violates the latter portion. I'm not interested in operating environments that can tolerate no real-world chaos, and I'm not aware of any commercially viable realms which would either. What -- push it back to the init system to keep trying again? But by what mechanism would that strategy be able to understand the fault that occurred, in order to try again better?

Anyway, kudos for purity of conviction (I guess).


Hubris is not an academic exercise: it runs at the heart of every element of the Oxide rack (compute sled, switch, power shelf controller) -- and its design is informed by delivered utility above all else. Indeed -- and as Cliff elaborated in the blog -- REPLY_FAULT was something that he thought initially perhaps too aggressive, but it was our own experience in building, deploying, and (it must be said!) debugging the system that gave him the confidence that it would make our systems more robust, not capriciously faulty.

For more details on the thinking here and what it looks like in practice, see (e.g.) [0] and [1].

[0] https://www.mattkeeter.com/blog/2024-03-25-packing/

[1] https://cliffle.com/blog/who-killed-the-network-switch/


> that can tolerate no real-world chaos, and I'm not aware of any commercially viable realms which would either.

Watchdog timers will happily kill/restart your processes that don't poke them often enough. Even in my hobby exercises I've seen I2C busses hang up often enough(and bring the whole system down!) when some protocol bit goes wrong that I think the design is actually quite inspired. As I understand it this isn't talking about known error cases(that are handled) but protocol mismatches and other things that shouldn't ever happen.

Many other comments touched on it but it's a purpose built OS, much in the same way I'm not going to build a UI in Erlang, Hubris seems well positioned for the space that it occupies.


> But by what mechanism would that strategy be able to understand the fault that occurred, in order to try again better?

I think the general idea is to apply this to problems which are clearly the result of an invalid program state, and therefore not reasonably recoverable. They are either caused by bugs, an attack, or corrupted hardware. In all cases you shouldn't continue, because there's something seriously wrong with the caller. If the caller continues, it could only cause more damage.

It sounds a bit like Erlang/OTP's "let it crash" philosophy. Erlang is used in quite a bunch of mission-critical hardware and is famous for its reliability, so it might not be such a huge dealbreaker in practice.


> It sounds a bit like Erlang/OTP's "let it crash" philosophy.

Which was based partly on ideas from Tandem Computers' NonStop / Guardian. Hardware and software were fail-fast i.e. they would work correctly or stop, so they couldn't corrupt data. If there was a problem, the whole processor / process would be stopped, and a backup took over, which seems somewhat similar to the "supervisor" tasks in hubris.

Quite a bit of a different use cases - an embedded os for microcontrollers vs large OLTP applications. They both could be considered "mission critical", at least for the people who own/make money with them.


From a "system engineering"(not to be confused with software engineering) perspective they seem quite similar, in my view even something like a watchdog timer(which just about every CPU/core has these days) is just a hardware version of similar philosophies. This[1] is one of my favorite overviews on Erlang and what drives some of those design decisions. You can absolutely apply the same systematic thinking to other domains/places without having to bring OTP or even Erlang into the conversation.

[1] https://ferd.ca/the-zen-of-erlang.html


It’s a 2000 line rust embedded systems kernel that doesn’t support adding new tasks at runtime. It is written to go deep in the guts of the 0xide server racks.


> Since attempts at exploitation often manifest first as errors or misuse of APIs, a system that responds to any misbehavior by wiping the state of the misbehaving component ought to be harder to exploit.

In this case your application is one that is a little more rigorous in checking what it accepts. So it has a security benefit, but not the kind you think it does: an attacker is not set back because you destroyed their progress, it’s that you made certain invalid states that were previously possible to chain into more desirable invalid states no longer work. So an attacker will look elsewhere instead of trying to do that.


I find Humility is a great name for a debugger. Many are the programmers that refuse to use debuggers and just stare the code down until it yields errors, under the assumptions that "good" code doesn't need debugging!


I find more bugs with a debugger. There’s typically the bug I was looking for, and then smaller bugs that didn’t technically cause the problem but contributed, and may be involved in the next issue. I want to fix those too, and sometimes first.


I find this attitude bizarre. Just earlier today I used python debugging to quickly figure out why an error was occurring. Being able to see the state of the variables without having to print each helped solve it instantly.


It's partly a religious thing, partly what you're used to and partly using the right tool for the job. Some programmers use debuggers as a crutch and some complex systems (e.g. that involve multiple distributed components or are timing dependent) can't be easily debugged using traditional debuggers.

EDIT: yet another factor is sometimes you may not even have access to the system you need to troubleshoot. Being able to reason about code execution without observing it is a useful skill (and still a debugger is a useful tool).


I recall server ABENDs in Novell NetWare. I think it was the OG of server violence.


Title sounds like it concerns a really fed-up waiter.


In a sense, it does: waiting is one of the main jobs of an OS kernel.


It sounds like this may be similar to using signals for error handling in a Unix system?


In some sense, yes, this is kind of like the kernel sending SIGKILL to a process.


Which some kernels do, actually. Not Linux but the ones that think you’re messing with things you shouldn’t be will SIGKILL you at the earliest opportunity.


Linux will sometimes, e.g., if a process violates seccomp.


Ah yes good point


I wonder if they’re going to find this creates security issues.

Processes keep state to analyze abuse of various kinds, and killing a process presumably wipes its memory. Unless there’s some way to retain state across restarts?


Yes, we have an in situ dump facility, which Cliff mentioned at the end of [0]; it's been essential for debugging these issues when we hit them.

[0] https://cliffle.com/blog/who-killed-the-network-switch/



I’m really enjoying his posts on this


That's QNX-type interprocess communication. QNX doesn't offer interprocess kill, though.


The designer of Hubris (and several folks who work on it) are familiar with QNX, for sure.


> Take Unix for example. If you call close on a file descriptor you never opened, you get an error code back. If you call open and hand it a null pointer instead of a pathname? You get an error code back. Both of these are violations of a system call’s preconditions, and both are handled through the same error mechanism that handles “file not found” and other cases that can happen in a correct program.

> On Hubris, if you break a system call’s preconditions, your task is immediately destroyed with no opportunity to do anything else.

Oh, yeah. I've long thought EBADF and EINVALs (and EFAULT, I guess) should basically always be fatal.


> The Hubris IPC scheme is deliberately designed to work a lot like a function call, at least from the perspective of the client.

That's a bona fide remote procedure call, isn't it?


In a sense, though most would think of "remote" as being "over the network," and that's not the case here.


I would advise the author to read up on "asynchronous exceptions" and check out how many systems have had them at some point and removed them.

I'm not saying that's because they're fundamentally impossible, but because they have a track record of tripping up language designers and it's good to cross check against the experiences.

Recommended languages are Java (ultimately a failure despite vast effort), and Haskell and Erlang where they work, but a lot of work of very different kinds was put in to make it work. I definitely get Erlang vibes from this piece so it's possible the preconditions for correct asynchronous exceptions are met or can be met here. But they are very subtle and have a tempestuous history of working 99.9% but it being literally impossible to get to 100%. This could be a big, big, big trap.


I am not familiar with what you're talking about (but Cliff may already know), I'll have to look into it. But Hubris is a synchronous system, and also, these faults aren't catchable, so I'm not sure how directly relevant it is. What's the specific issue you're worried about?

Your Erlang vibes are there for good reason, it's certainly an influence.


Reaching out and nuking other... whatever you call them, "execution contexts" is what I go for to be maximally generic (thread/async task/continuation/generator/etc.), can particularly cause problems if the context was going to do X, Y, and Z and expected to be able to be guaranteed to run Z, but Y killed it. The standard example is for X to be taking a lock and Z to release it, but there are a lot of ways to get into trouble and the obvious first solutions don't work.

Erlang solves it by locking what things it has that can have that problem behind other execution contexts that don't get killed when the main one dies, so they can still clean up. ("Ports", in their terminology.) Haskell solves it by being a functional language and beating the collective community's head in it for several years. (Immutability helped a lot, laziness took out back.)

If that sounds impossible... hey, great! Then I just pattern matched on something that wasn't a match. If that doesn't sound impossible, then it may be worth a look around.

Synchronousness may not really matter, I've kind of thought that "asynchronous exception" is not a good name for the issue for a while, but it's what it gets called. It's really about one execution context lobbing errors/exceptions into others. Although being synchronous would avoid the worst timing issues.


Ah, that problem in general I am familiar with, yes.

Tasks in Hubris are independently compiled programs, not threads in a shared context. So I don't believe that it's an issue. You don't share locks between tasks, you create a task that holds the shared resource, and have the two tasks that want to share it talk to that task, patterns like that.


>Recommended languages are Java (ultimately a failure despite vast effort)

Why is Java a failure? Recent JVMs have come a long way, and GraalVM makes it somewhat comparable to Go-like languages.

I understand the historical hate and how Oracle bought it, but it really isn’t that bad of a language if you’re using modern Java.


They're referring to that specific feature of Java being a failure, not the language in general.


The parent was talking about async exception in java -- like the InterruptedException . They are hard to work or reason with.


huh? Do not see anything asynchronous in author's work. It's all synchronous, because IPCs in hubris are synchronous too.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: