Hacker News new | past | comments | ask | show | jobs | submit login

> Crash Dump Analysis...In an environment like ours (patched LTS kernels running in VMs), panics are rare.

As the order of magnitude of systems administered increases, rare changes to occasional changes to frequent. Especially when it is not running in a VM.

Also, from time to time you just get a really bad version of a distro kernel, or some off piece of hardware that is ubiquitous in your setup, and these crashes become more frequent and serious.

(Recent example of a distro kernel bug - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1674838 . I foolishly upgraded to Ubuntu 17.04 on its release in stead of letting it get banged around for a few weeks. For the next five weeks it crashed my desktop about once a day, until a fix was rolled out in Ubuntu proposed)

Most companies I've worked at want to have some official support channels, so usually we'd be running RHEL, and if I was seeing the same crash more than once I'd probably send the crash to Red Hat, and if the crash pointed to the system, then the server maker (HP, Dell...) or hardware driver maker (QLogic, Avago/Broadcom).

Solaris crash dumps worked really well though - they worked smoothly for years before kdump was merged into the Linux kernel. It is one of those cases where you benefited from the hardware and software both being made by the same company.




Crash dumps don't matter as much if your distributed architecture has to account for hardware failures. (Or VM failures, or network hiccups, etc.)

Kernel developers still have to use crash dumps to root-cause an individual crash, but crash dumps are most useful for extremely hard-to-reproduce crashes that are rare (but if you are using the "Pet" model as opposed to the "Cattle" model, even a single failure of a critical DB instances can't be tolerated). For crashes that are easy to trigger, crash dumps are useful, but they are much less critical to figure out what's going on. If your distributed architecture can tolerate rare crashes, then you might not even consider worth the support contract cost to root cause and fix every last kernel crash.

Yes, it's ugly. But if you are administrating a very large number of systems, this can be a very useful way of looking at the world.


> As the order of magnitude of systems administered increases, rare changes to occasional changes to frequent. Especially when it is not running in a VM.

I think the perf engineer for Netflix is quite aware of this.


While I really respect Brendan's opinion (I've got most of his books and he is one of my IT heroes) I do think he is very netflix-IT-scale minded. When your Netflix you can maintain your own kernel with ZFS, DTrace, etc. and have a good QA setup for your own kernel / userland. Basically maintain your own distro. However when your in a more "enterprisy" environment you don't have the luxury of making Ubuntu with ZoL stable yourself. I know from first hand experience that ZoL is definitely not as stable as FreeBSD ZFS or Solaris ZFS.


And there it is - the elephant in the room noone mentioned. People in 99% of the IT shops get an existential crisis if you mention during the interview that you want to do kernel engineering. Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: