> Crash Dump Analysis...In an environment like ours (patched LTS kernels running...

tytso · on Sept 5, 2017

Crash dumps don't matter as much if your distributed architecture has to account for hardware failures. (Or VM failures, or network hiccups, etc.)

Kernel developers still have to use crash dumps to root-cause an individual crash, but crash dumps are most useful for extremely hard-to-reproduce crashes that are rare (but if you are using the "Pet" model as opposed to the "Cattle" model, even a single failure of a critical DB instances can't be tolerated). For crashes that are easy to trigger, crash dumps are useful, but they are much less critical to figure out what's going on. If your distributed architecture can tolerate rare crashes, then you might not even consider worth the support contract cost to root cause and fix every last kernel crash.

Yes, it's ugly. But if you are administrating a very large number of systems, this can be a very useful way of looking at the world.

rodgerd · on Sept 5, 2017

> As the order of magnitude of systems administered increases, rare changes to occasional changes to frequent. Especially when it is not running in a VM.

I think the perf engineer for Netflix is quite aware of this.

jsiepkes · on Sept 5, 2017

While I really respect Brendan's opinion (I've got most of his books and he is one of my IT heroes) I do think he is very netflix-IT-scale minded. When your Netflix you can maintain your own kernel with ZFS, DTrace, etc. and have a good QA setup for your own kernel / userland. Basically maintain your own distro. However when your in a more "enterprisy" environment you don't have the luxury of making Ubuntu with ZoL stable yourself. I know from first hand experience that ZoL is definitely not as stable as FreeBSD ZFS or Solaris ZFS.

Annatar · on Sept 6, 2017

And there it is - the elephant in the room noone mentioned. People in 99% of the IT shops get an existential crisis if you mention during the interview that you want to do kernel engineering. Thank you!