Kernel debugging for newbies

bcantrill · on Dec 18, 2017

For those of us coming from non-Linux systems (and I'm speaking for illumos/SmartOS here), the effort required to debug the kernel displayed here is galling. That this is so difficult reflects Torvalds' historic disposition against kernel developers[1]:

  I don't like debuggers. Never have, probably never will. I use gdb all the
  time, but I tend to use it not as a debugger, but as a disassembler on
  steroids that you can program.
  
  None of the arguments for a kernel debugger has touched me in the least.
  And trust me, over the years I've heard quite a lot of them. In the end,
  they tend to boil down to basically:
 
  - it would be so much easier to do development, and we'd be able to add
    new things faster.

  And quite frankly, I don't care. I don't think kernel development should
  be "easy". I do not condone single-stepping through code to find the bug.
  I do not think that extra visibility into the system is necessarily a good
  thing.

To be fair, this was a long time ago (17 years ago!), but the experience relayed here leaves one believing that Torvalds' historic attitude has cast a long shadow.

And if it needs to be said, Torvalds' arguments are themselves deeply confused; he has conflated single-step in situ debugging (which does in fact suffer from limited utility in the context of an OS kernel) with debugging writ large. So as he rejected in situ debugging, he also implicitly rejected postmortem debugging and dynamic instrumentation -- both of which have proved absolutely essential for kernel development. (Indeed, it is likely that DTrace alone would have allowed the author to debug their problem, as its design center is exactly the kind of non-fatal failure described.)

The tutorial will certainly save others pain, but that such pain still exists at all in Linux is deeply unfortunate, and a vivid example of Linux not representing anything close to the state-of-the-art in systems development.

[1] https://lwn.net/2000/0914/a/lt-debugger.php3

alambert · on Dec 18, 2017

I agree! For Windows, getting symbols was trivial: you just pointed your debugger to the publicly-available symbol server[1]. You can see this attitude continuing in [2]: a beginner is warned away from using a debugger to understand the kernel behavior.

[1] https://msdn.microsoft.com/en-us/library/windows/desktop/ee4... [2] https://lists.kernelnewbies.org/pipermail/kernelnewbies/2016...

bcantrill · on Dec 18, 2017

Wow, that second example is very telling. For whatever it's worth, before we started work on DTrace (ca. 2001), we prioritized a project to add not just symbol information but also debugging information to production kernels. This project -- the Compact C Type Format (CTF)[1][2] -- has proved essential many times over in the years since (and is a major reason why we can consider kernel debugging a part of the core system functionality). It is clear that a project like CTF is unlikely to even be understood let alone prioritized by those displaying such a dismissive attitude towards debugging.

[1] http://illumos.org/man/4/ctf

[2] http://www.smnd.sk/lovasko/paper.pdf

harry8 · on Dec 19, 2017

Brian, I love your work but seriously. You've mispelled "read" as "understood" there and to write it off as a dismissive attitude toward debugging in toto is not something you should lower yourself to pretend to believe. Deep breath. DTrace is awesome, Oracle is not.

Yeah my kissing may not be up to scratch either... ;-)

bcantrill · on Dec 19, 2017

The attitude towards debugging is dismissive -- and I'm not the only one to have drawn that conclusion. Indeed, it's the original author of this article who came across that and drew that inference -- one that I (obviously) share. To flip this around: do you think that the work outlined in the original article is work that should be expected of anyone wishing to debug the kernel?

harry8 · on Dec 19, 2017

I think there is room for reasonable people to disagree reasonably about most issues. Flip that around?

Linux is a long way from being the buggiest OS kernel I've ever used, how about you? Perhaps they've found and fixed some bugs? Perhaps they've prevented some from being written? Perhaps their approach is something that can be disagreed with, even strongly so, on the grounds of being less than optimal without suggesting that it has zero merit and by extension its proponents are somehow to be considered with derision? The inference that anyone hacking any OS kernel is too stupid to understand a differing idea is probably not necessary and unlikely to be justified in my humble opinion. You may of course reasonably disagree and maybe one of us did understand something the other did not on the point? Anyway this is now dull.

But don't be less opinionated, that would be the wrong response!

Dylan16807 · on Dec 19, 2017

> Perhaps their approach is something that can be disagreed with, even strongly so, on the grounds of being less than optimal without suggesting that it has zero merit

If a yes or no attitude results in fewer bugfixes, all else equal, then that attitude does have exactly zero merit.

makomk · on Dec 18, 2017

Errm, the email at https://lists.kernelnewbies.org/pipermail/kernelnewbies/2016... seems to be warning him against using an unusual set of custom compiler optimisation options that none of the kernel developers have ever tested with rather than against using a debugger full stop.

alambert · on Dec 18, 2017

The beginner wants to disable optimizations, so he can better understand the kernel behavior by stepping through it in a debugger; he's told this isn't possible. Compare this with the availability of checked builds of Windows components, which support debugging by disabling optimizations: https://docs.microsoft.com/en-us/windows-hardware/drivers/de...

vbitz · on Dec 19, 2017

Checked builds are no longer distributed.

Source: https://social.msdn.microsoft.com/Forums/windowsdesktop/en-U...

yuhong · on Dec 19, 2017

I think checked builds did not disable optimizations, though it did include useful stuff.

ploxiln · on Dec 19, 2017

That's a great Torvalds emails which I haven't seen before! Here's the "point" though:

> I happen to believe that not having a kernel debugger forces people to think about their problem on a different level than with a debugger. I think that without a debugger, you don't get into that mindset where you know how it behaves, and then you fix it from there. Without a debugger, you tend to think about problems another way. You want to understand things on a different _level_.

I think that's really pretty reasonable. Surely you've seen a programmer who ran into a bug and submitted a pull request saying "I fixed it" but it's clearly just a messy work-around which misunderstands the problem and leaves a bunch of analogous problem cases unfixed.

I prefer printf-and-inspection debugging myself, I only reach for gdb once a week or so. I'm pretty diligent and my results are quite solid. It's fine that we have different styles, and because of well-designed interfaces, like process boundaries, we can co-exist in peace.

Linus is really counter-cultural in today's hip software world, where everyone says everything should be easier and easier. We just can't ever make threads or async or crypto easy enough. Meanwhile the popular applications and frameworks in use seem to get bigger and slower and buggier. Correlation, not causation, but still an annoying trend.

racer-v · on Dec 18, 2017

> he has conflated single-step in situ debugging (which does in fact suffer from limited utility in the context of an OS kernel) with debugging writ large.

Where does he do this? The quote is only about single-step debugging.

> So as he rejected in situ debugging, he also implicitly rejected postmortem debugging and dynamic instrumentation

Do you have any evidence of this? It seems to me that dump traces and kernel probes have been an important part of Linux for many years.

bcantrill · on Dec 18, 2017

No, the quote is not only about single-step debugging; his next sentence is: "I do not think that extra visibility into the system is necessarily a good thing." And if that still leaves too much ambiguity for you as to his thinking, read the rest of the screed -- and then remember how we got here: because it took someone two painful days to configure something that is considered core functionality on many other systems.

richardwhiuk · on Dec 18, 2017

It's a lot easier to configure using the modern API - the author is intentionally choosing an API which doesn't give much feedback....

xenophonf · on Dec 18, 2017

That's true, but the author has solid reasons for doing so:

> I want my application to work on OS X and Linux, so I’m targeting PF_KEYv2 instead of OS-specific APIs.

makomk · on Dec 18, 2017

If the reason he's using a horribly unfriendly and difficult to debug API is because it's cross-platform and the more developer-friendly replacement is Linux only, then the fact that the API is so hostile certainly isn't proof - as bcantrill was claiming - that Linux is somehow going out of its way to be hostile to developers compared to competing OSes.

alambert · on Dec 18, 2017

PF_KEYv2's difficult interface definitely isn't a Linux issue. But PF_KEYv2's limited error reporting means that understanding what's going wrong requires kernel debugging. The difficulty of kernel debugging is a Linux issue. ("I do not condone single-stepping through code to find the bug.")

[1] https://news.ycombinator.com/item?id=15953644

AstralStorm · on Dec 18, 2017

Not to mention relocatable kernel and kexec crash dump analysis.

rhinoceraptor · on Dec 18, 2017

This is Linux after all, why would you need postmortem debugging if the system never kills itself when it finds itself in an invalid state? :-)

Twirrim · on Dec 18, 2017

For what it's worth, Microsoft has the same approach to debugging the kernel via WinDBG.

Avery3R · on Dec 18, 2017

You can debug the kernel locally using windbg, it's pretty easy to freeze your system doing that though.

richardwhiuk · on Dec 18, 2017

How do you suggest a kernel debugger should work?

lowleveldesign · on Dec 18, 2017

Debugging a syscall or start of the process was a great way for me to learn the system internals. I have some experience with Windows debugging and, after reading the article, I find that configuring the kernel debugging in Windows is quite easy. And I really like the live kernel debugging feature, when you either use windbg (that requires the debug boot flag) or simply run livekd [1] to analyze the running system data (for instance ALPC connections, handles, or loaded drivers data). Is there anything similar available in Linux? I plan to learn Linux internals and would love to use the kernel debugger next to reading the source code and books.

Tangential, but if there is anyone interested in Windows debugging (including kernel debugging) have a look at the Inside Windows Debugging book by Tarik Soulami [2]

[1] https://docs.microsoft.com/en-us/sysinternals/downloads/live...

[2] https://www.amazon.com/Inside-Windows-Debugging-Developer-Re...

Timothycquinn · on Dec 19, 2017

Great to see this and I hope I never have to do this on Linux.

A few years back, when I was finally getting my personal dev box off of windows, I took a very close look at FreeBSD or derivative. I hit the wall with poor touchpad support which made laptop difficult to use. I tried very hard to debug their touchpad kernel library but had no luck getting the results I needed. Anybody have links on how to do the same kind of remote kernel debugging on FreeBSD like above but using physical box rather than a VM?

I would still like to give that can another good kick'n :)

cuckcuckspruce · on Dec 18, 2017

Two machines seems like overkill in this case - why not debug using User Mode Linux[1]? Debugging with two machines makes perfect sense if you are debugging a hardware driver, but not here.

[1] http://opensourceforu.com/2010/09/user-mode-linux-setup-and-...

AstralStorm · on Dec 18, 2017

Because UML cannot interact with most hardware and that is where bugs happen. Even VMs do not provide real hardware or allow you to bring down the machine with it.

Instead you can debug a crash using a trace and kexec dump. And also use quite fast ftrace infrastructure which is much better than plain old printf.

There are also kprobes, gcov and oprofile.

chowyuncat · on Dec 19, 2017

Network debugging over VMWare to a 2.6 kernel (e.g. RHEL 6) will not work much of the time; a virtual serial port must be used instead.

CalChris · on Dec 18, 2017

The BLIT had a debugger which was named joff. Unix had printf.