There are so many things in here that tempt me to comment about, so here goes:
1) For me, this is a prime example of why I personally like programming environments with exceptions. If libnih could throw an exception (I know it can't), then they could do that which would allow the caller to at least deal with the exception and not bring the system down. If they don't handle the exception, well, we're were we are today, but as it stands now, fixing this will require somebody to actually patch libnih.
Yes. libnih could also handle that error by returning an error code itself, but the library developers clearly didn't want to bother with that in other callers of the affected function.
By using exceptions, for the same amount of work it took to add the assertion they could also have at least provided the option for the machine to not go down.
Also, I do understand the reservations against exceptions, but stuff like this is what makes me personally prefer having exceptions over not having them.
2) I read some passive aggressive "this is what happens if your init system is too complicated" assertions between the lines.
Being able to quickly change something in /etc/init and then have the system react to that is actually very convenient (and a must if you are pid 1 and don't want to force restarts on users).
Yes, the system is not prepared to handle 10Ks of init scripts changing, but if you're root (which you have to be to trigger this), there are way more convenient (and quicker!) ways to bring down a machine (shutdown -h being one of them).
Just removing a convenient feature because of some risk that the feature could possibly be abused by an admin IMHO isn't the right thing to do.
3) I agree with not accepting the patch. You don't (ever! ever!) fix a problem by ignoring it somewhere down the stack. You also don't call exit() or an equivalent in a library either of course :-).
The correct fix would be to remove the assertion, to return an error code and to fix all call sites (good luck with the public API that you've just now changed).
Or to throw an exception which brings us back to point 1.
I'm not complaining, btw: Stuff has happened (great analysis of the issue, btw. Much appreciated as it allowed me to completely understand the issue and be able to write this comment without having to do the analysis myself). I see why and I also understand that fixing it isn't that easy. Software is complicated.
The one thing that I'm heavily disagreeing though is above point 2). Being able to just edit a file is way more convenient than also having to restart some daemon (especially if that has PID 1). The only fix from upstarts perspective would be to forego the usage of libnih (where the bug lives), but that would mean a lot of additional maintenance work in order to protect against a totally theoretical issue as this bug requires root rights to use.
The library noticed the error and was (effectively) calling exit().
If the library was at liberty to throw, they would not have to change any internal callers and would get the same feature for them (not having to deal with errors anywhere) but they would still give the external caller a chance not to terminate.
Yes. By default, a caller would terminate, but it would get a chance not to in a way that doesn't involve changing the libraries public interface.
Exceptions are not a silver bullet. One can't simply turn an assert into an exception and expect things to work. That's a great way to add new and tricky bugs unrelated to the original issue.
If an exception crosses a stack frame that isn't expecting one, things could be left in an inconsistent state. Invariants may vary, constraints may be unconstrained.
To retrofit exceptions into the existing code at this point would be very difficult, as you would have to audit every caller of the function in question, a nd the callersof those, and on and on...
Imagine problems like Apple's "goto fail" SSL bug, but more obscure and harder to find.
To have used exceptions from the beginning would require an amount of work roughly equivalent to returning error codes. They aren't fundamentally different from error codes in that way. Well, unless you're a cowboy who ignores errors.
There are a lot of things that aren't silver bullets. Oddly though, they are still helpful.
> One can't simply turn an assert into an exception and expect things to work.
Actually, if you did that, you'd expect them to fail... but with better semantics.
> That's a great way to add new and tricky bugs unrelated to the original issue.
Only if your runtime doesn't allow for the clean expression of stack unwinding semantics.
> To retrofit exceptions into the existing code at this point would be very difficult.
Yes, I don't think that was at all related to the original point though. It'd be hard to retrofit almost any new language feature in to the existing code.
> To have used exceptions from the beginning would require an amount of work roughly equivalent to returning error codes. They aren't fundamentally different from error codes in that way. Well, unless you're a cowboy who ignores errors.
Not at all. Error codes require error handling in every caller up the call chain. Exceptions at least allow you to put your error handling logic just where you have handlers.
> Only if your runtime doesn't allow for the clean expression of stack unwinding semantics.
The point here is that if a library has already chosen to not cleanly unwind anything, your application catching the exception isn't going to help the garbled state created by the library.
I think cbsmith's point is that a a good language makes it so easy to correctly unwind the stack that it's almost harder to not do it right. If we could go back in time and do it over using exceptions and a better language, this likely would have been an exception from the start, which would have made this bug easier to fix. For example, Python's "assert" raises instead of exiting the process. The caller could just catch the exception from the failed assertion and deal with it, without need to update any other code. The problem is, we can't easily change languages now, nor can we safely retrofit exceptions. Hindsight is 20/20.
It's worth noting that that argument depends on the exception-raising assertion having been present from the beginning. The correctness of stack unwinding code can sometimes depend of which exceptions a function raises. A particular piece of code that is currently correct may break if one of the functions it calls is changed to raise a new type of exception.
[optiplex /home/chris/tmp/atexit_main]
$ ./testme
This is main. I will sleep a while, then try to exit.
PANIC! main() has been restarted due to a unscheduled exit!
This is main. I will sleep a while, then try to exit.
PANIC! main() has been restarted due to a unscheduled exit!
This is main. I will sleep a while, then try to exit.
PANIC! main() has been restarted due to a unscheduled exit!
This is main. I will sleep a while, then try to exit.
1) Exceptions are not a solution and are in fact no different than error codes in this case. If your init system throws an exception, what does that mean? It means your computer goes down.
The actual solution here is Maybe / Optionals (depending on your language of choice) as Haskell and Rust and Scala have. These would require the caller to unwrap the result, and thus make this bug actually impossible. The "returns -1" really is the equivalent of a null pointer exception which is an issue that is solved by Options not by Exceptions. Exceptions are definitely not a solution in any way.
2) I didn't see the "too complicated" between the lines bit. However, you're wrong about having to be root to trigger this. The watch, if you missed it, was on all of /etc. There are many folders in /etc which are not root owned. Furthermore, just because you have to be root to cause unexpected behavior means little in this case. Sure, this won't likely be exploited maliciously to take down a box and if that were the fear it wouldn't matter. That wasn't the concern though. The concern is this happening by accident. Even if a program is running as root, it shouldn't be able to bring down the box purely by accident.
3) A patch that prevents an init system from crashing at the expense of missing a file change (which was under /etc and quite possibly not even relevant to the init system) isn't that bad. Take it and add an issue to improve it (e.g. by directory walking as suggested). Seriously, your init system crashing because a few programs decide to write a few hundred default config files to /etc, is dumb.
For an easy example: using ansible-galaxy to fetch a role/playbook, by default (at least in the past) installed in /etc/ansible/roles/$ROLENAME. These roles can be on the order of several thousand files. The init system need not care about this at all.
The most important point here, though, is that Exceptions are absolutely not the solution here and, in fact, we've known that exceptions are fundamentally wrong for the last 40 years and yet people still learn them as part of Java and think they're good design.
> Being able to quickly change something in /etc/init and then have the system react to that is actually very convenient (and a must if you are pid 1 and don't want to force restarts on users).
I'm not sure I see how it's a must. Why not just have init respond to SIGHUP like so many daemons and reload its configuration then?
And that points to a fairly simple defensive practice to avoid this type of problem:
Run it in a different process. Have said process signal if a change is found. If they want to handle the case where /etc/init.d is potentially huge and you don't want to rescan everything, have it write changes via a pipe.
(In fact there's a program that will do that for you: inotifywait)
Agreed, it's pretty standard behaviour to signal daemons to cause them to reload config. SIGHUP'ing it is how you get sysvinit to reload its config too.
I see why and I also understand that fixing it isn't that easy. Software is complicated.
The flip side of this is that perhaps we should aim to reduce the complexity of software. In this case, it's not like libnih is providing any significant increase in functionality over just using inotify anyway.
1. Exceptions are basically 'goto' with a pretty wrapper. Over time, I've come to believe that the best place to handle an error is close to where it occurs, rather than throwing some hail mary message up the stack and hope someone else can deal with it.
2. Well, it certainly is true that more code and more functionality is likely to contain more bugs. Whether your pid 1 needs to be doing all the things that upstart does is obviously a contentious issue.
3. This is a toss-up. I usually like 'crash loudly' over 'fail silently', but I think rejecting the patch outright was a poor decision.
> Famously, "for loops" are basically 'goto' with a pretty wrapper.
Trivial response: for loops don't potentially involve jumping into code in a completely different file, which could have been written by someone else.
More substantive response: for loops don't imply a specific philosophy about error handling.
That said, I do like exceptions, as I think they encourage a healthy ignorance in most code, as long as the code which can handle the exception does handle it. But that's just a "don't write bad code" admonition, and you can abuse any philosophy to write bad code.
1. It is a different mindset. Handling each error code until your program is correct top to bottom makes for a much more stable existence, but it's harder when you rely on something that hasn't handled an error code. With exceptions it is easier to work with shitty middleware. Also, in environments where exceptions are idiomatic, there are ways of cleaning up aborted code and safely entering exception handlers. Far more straightforward to reason about than `GOTO 34550`.
I like Go's approach: no asserts, and only use exceptions for really exceptional stuff. (Go calls exceptions panic/recover, and tweaks the recover syntax in such a way that it's much less tempting to use it for normal error handling.)
From an outsiders perspective, with zero-knowledge, it seems like libnih should be propagating back the buffer-full error.
I agree that silently failing to reload config files in only certain circumstances could be just as bad or worse behavior than just crashing, but both behaviors seem really really bad in core system software, n'est pas?
So assert is an interesting combination of several things:
- A check that can be disabled at compile time
- A check that blows up the program if it fails
- A check that has a really convenient syntax
The "disabled at compile time" bit is pretty much insane (at least in a language with side-effects) -- you'll inevitably end up accidentally having some sort of side-effect in your asserts, and if you turn them off for your production build, you're basically inserting a whole brand-new untested configuration, running in a build that is particularly hard to debug. But probably most people just leave asserts "on" all the time. (Even if you don't have side-effects in your asserts, you're still enabling untested run-paths, which is going to give weird, hard to debug error reports.)
The second thing is blowing up the program. Sometimes, this really is all you can do, and every language has some way for this to happen -- take the head of an empty list, access beyond an array bounds, whatever.
The problem is combining it with the really convenient syntax -- it's just very tempting to assert conditions which you should actually be handling more robustly, because assert is the easiest thing to write.
I'm doing a mix of Go server-side dev and Obj-C client side dev at the moment, and I do use asserts in Obj-C land.
I try to only use them only for truly "programmer screwed up" conditions, which should all be caught during development. But we still get crashes now and then from real users from asserts that shouldn't have been asserts; we fall into temptation.
So I suspect that actually we might be better of in Obj-C land, too, avoiding asserts.
(I also like the invariant-style thinking and documentation quality of asserts, though.)
Seems to me a better approach might be to catch the error, and then retry the operation until it succeeds, maybe crashing out if no process is made after enough time (the reader died).
This will cause the library to stall, which might be unacceptable to some people, but if simply noting the error and dropping it is unacceptable then you really don't have a choice.
1) For me, this is a prime example of why I personally like programming environments with exceptions. If libnih could throw an exception (I know it can't), then they could do that which would allow the caller to at least deal with the exception and not bring the system down. If they don't handle the exception, well, we're were we are today, but as it stands now, fixing this will require somebody to actually patch libnih.
Yes. libnih could also handle that error by returning an error code itself, but the library developers clearly didn't want to bother with that in other callers of the affected function.
By using exceptions, for the same amount of work it took to add the assertion they could also have at least provided the option for the machine to not go down.
Also, I do understand the reservations against exceptions, but stuff like this is what makes me personally prefer having exceptions over not having them.
2) I read some passive aggressive "this is what happens if your init system is too complicated" assertions between the lines.
Being able to quickly change something in /etc/init and then have the system react to that is actually very convenient (and a must if you are pid 1 and don't want to force restarts on users).
Yes, the system is not prepared to handle 10Ks of init scripts changing, but if you're root (which you have to be to trigger this), there are way more convenient (and quicker!) ways to bring down a machine (shutdown -h being one of them).
Just removing a convenient feature because of some risk that the feature could possibly be abused by an admin IMHO isn't the right thing to do.
3) I agree with not accepting the patch. You don't (ever! ever!) fix a problem by ignoring it somewhere down the stack. You also don't call exit() or an equivalent in a library either of course :-).
The correct fix would be to remove the assertion, to return an error code and to fix all call sites (good luck with the public API that you've just now changed).
Or to throw an exception which brings us back to point 1.
I'm not complaining, btw: Stuff has happened (great analysis of the issue, btw. Much appreciated as it allowed me to completely understand the issue and be able to write this comment without having to do the analysis myself). I see why and I also understand that fixing it isn't that easy. Software is complicated.
The one thing that I'm heavily disagreeing though is above point 2). Being able to just edit a file is way more convenient than also having to restart some daemon (especially if that has PID 1). The only fix from upstarts perspective would be to forego the usage of libnih (where the bug lives), but that would mean a lot of additional maintenance work in order to protect against a totally theoretical issue as this bug requires root rights to use.