.NET's FileSystemWatcher documentation says that there's an internal buffer of a...

MaulingMonkey · on Nov 24, 2014

I'd generally take a config file failing to update (this may happen anyways from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't available) over the box dying with a terrible crash report.

I'd much prefer to accept the patch with a followup request, tracking issue, or TODO. There wass good reason to not consider the issue resolved, but I'm not of the opinion there was good reason to reject the patch either.

Morgawr · on Nov 24, 2014

>I'd generally take a config file failing to update (this may happen anyways from the looks of it, e.g. if "/proc/sys/fs/inotify/max_queued_events" isn't available) over the box dying with a terrible crash report.

I'm not entirely sure I'd agree with that line of reasoning, to be honest. While the crash is a big problem and falls into the category of "obscure enough" problems that might leave your sysadmins scratching their heads for quite some time (especially because of the contrived and terrible crash report message), I would much prefer my system to crash and halt instead of failing to recognize/load a config file. Not loading config files properly (and failing silently, at that) means that you might be exposed to security attacks, your system might be in an inconsistent state until the next reboot and you might not even know it. I'll take a failure over prolonged inconsistent state of a machine any day. That's what redundancy solves after all.

vidarh · on Nov 24, 2014

If you are concerned about config changes taking effect, you should be explicitly reloading affected services anyway and/or verifying that the change has taken effect, not blindly rely on your init to have picked up the change.

This is a false dichotomy anyway: The solution is not to not apply the patch if the submitter doesn't provide an alternative, but recognise that this is a potentially serious problem and fix it anyway.

There are multiple alternatives:

* Fix libnih.

* Fork libnih for upstart, and fix it there.

* Ditch libnih for this, and catch the problem yourself.

* fork() and treat the child failing as a sign you need to rescan the entire config directory after restarting it.

* Regularly rescanning the directory anyway, on the assumption that Things-Go-Wrong and events will get missed out for some reason or the other.

More than one of those can be applied at once. Personally I'd opt for the two last in combination.

sergiosgc · on Nov 24, 2014

Failing silently should no be an option. This is a case where the best is enemy of the good. The patch is good, albeit not optimal. Hanging on to a bad solution, in presence of a good solution, because you can imagine an optimal solution was a bad call.

The patch should log a warning/error and avoid the crash. It does not preempt an optimal solution, by traversing the file tree, being developed afterwards.

x0x0 · on Nov 24, 2014

loud suicide rather than quiet misbehavior (nobody reads logs until someone notices something wrong) is often preferred, at least for distributed systems.

MaulingMonkey · on Nov 24, 2014

> Failing silently should no be an option.

In this case, by my standards, it already is. Some relatively unrelated bit of code is later failing loudly.

> This is a case where the best is enemy of the good.

Neither best nor good happened in this case - so I'd flip your terms around. The best of intentions and standards prevented good.

> The patch should log a warning/error and avoid the crash. It does not preempt an optimal solution, by traversing the file tree, being developed afterwards.

On this we can agree wholeheartedly. That'd be a great followup changelist in lieu of doing the optimal solution, assuming the latter is too time consuming.

MaulingMonkey · on Nov 24, 2014

> your system might be in an inconsistent state until the next reboot and you might not even know it

With the kernel panic, you'll be ending up in an inconsistent state even after the reboot - assuming it reboots - when it happens in the middle of un-taring your configuration files.

Morgawr · on Nov 24, 2014

At least you WILL know that your system is in an inconsistent state and you know something wrong happened and you should take action.

Plus, there are modern features in kernels that allow you to audit your kernel panics, figure out what is wrong and, hopefully, everything should be automatically reverted back to how it was thanks to backups and journaling on the file system (to prevent corrupted images, that is).

Of course this won't automatically fix your problems for you or finish your un-taring of config files, but that's why we have sysadmins, isn't it?

masklinn · on Nov 24, 2014

> It's mentioned that a patch wasn't accepted, but it was with good reason - it doesn't fix the problem (by traversing the directory).

The patch can not fix the problem: upstart can't scan the directory (or ignore the overflow, or handle it however it would) because libnih blows up without yielding to upstart. It does not call a NihWatch handler and does not return an error code.

So the first step is to have libnih at least not blow up (which the patch did), ideally notify the user (which the patch didn't). The purpose of the patch was not to fix the issue, it was to fix libnih's broken handling of inotify.

tankenmate · on Nov 24, 2014

If I was using this system I would say that it isn't up to libnih to fix the problem; inform init and let init do a complete config file(s) reload. This is very similar to the way the kernel handles interrupts; when there aren't many it handles hardware in default interrupt mode (e.g. one interrupt per packet), but when the rate of interrupts gets too high it moves to non blocking synchronous polling (e.g. keep iterating through packets until a buffer fills/reschedule or the input queue empties).

masklinn · on Nov 24, 2014

> If I was using this system I would say that it isn't up to libnih to fix the problem

It isn't up to libnih to handle the overflow, but the problem right now is in libnih itself[0] and only libnih can fix that.

[0] and is twofold: libnih blows up on inotify overflow instead of not blowing up on a relatively normal (if rare) situation, and consequently (and obviously) libnih doesn't provide any way for its caller to handle the overflow.

vidarh · on Nov 24, 2014

That's true. What upstart could have done, though, is to watch for config changes in a separate process, so that it does not need to have an in-process dependency on something like this.

That would give upstart a way of handling the overflow and other (unknown) potential problems: rescan the config if the config watcher dies (after respawning the watcher).

lsaferite · on Nov 24, 2014

This seems like a reasonable solution. And one that sandboxes failures in that code path to prevent PID 1 from crashing. It's a shame this is the only comment (that I've seen) even mentioning this.

barrkel · on Nov 24, 2014

There's a risk that this is a case of "best is the enemy of the good".

Rather than fix an issue that can crash a whole box, hold out until the "right thing" is done.

rikkus · on Nov 24, 2014

That's a good point, but there's also the argument that it's better to stop and log why, than silently fail and let people try and figure out why something's broken.

I don't think either are great here, so the only real way to fix this is to, well, fix it!

justincormack · on Nov 24, 2014

As far as I remember all the Linux kernel to userspace notifications eg netlink as well behave like this. Of course it is a pain to do and hard to test so few people do but if you are writing a library you clearly should especially if it is being run by init.