Hacker News new | past | comments | ask | show | jobs | submit login
The Linux kernel can spawn processes on its own (uninformativ.de)
149 points by zdw on June 11, 2022 | hide | past | favorite | 50 comments



When the kernel is the gatekeeper for the system, obviously it can do anything userspace can, and then some.


Also, I can speak 10 languages fluently... in the sense that nobody has the power to stop me from learning 9 more languages if I should decide to do that.


It can do anything theoretically. It doesn't do anything architecturally. For example, contrary to NT, syscalls can't callback to userspace.


> For example, contrary to NT, syscalls can't callback to userspace.

There are some mechanisms that can call back into userspace during syscalls such as seccomp filters, FUSE, ptrace, userfaultfd, fanotify, the syscall_user_dispatch feature used by wine... There's the core_pattern handler too.

Someone summarized it that a mov instruction could be serviced by starting a python process.


Genuine question: how do you know such things? I'd love to learn them, so I'm wondering how others learn them (other than remembering random comments from here).


`ptrace(2)` and `userfaultfd(2)` show up when doing program tracing/analysis, among other places. I don't know of a great resource for the latter, but Eli Bendersky has a terrific series on debuggers that covers `ptrace(2)`[1].

[1]: https://eli.thegreenplace.net/2011/01/23/how-debuggers-work-...


This is getting off topic but what the heck does the number in parenthesis mean on man pages?


Man sections, a legacy from ye olden days when man pages were printed on paper and you'd use numbered sections for faster navigation (or splitting them into separate books if there are too many pages).

For example, `man 2 select` (or `man select.2`) gives you information about the `select` syscall, and `man 3 select` has information on the libc wrapper around that syscall.

A somewhat better example might be time: `time.1` has info on the command use to calculate running time of another command (in statements like `time make -j`, and `time.7` — low-level information on how to interact with kernel's time-related functionality. Since both are named `time`, if you just use `man time` without the number, you get the first one (and no way to get to the second).

Read `man man`, it has a lot more on this stuff.

  The table below shows the section numbers of the manual followed by the types of pages they contain.
  
         1   Executable programs or shell commands
         2   System calls (functions provided by the kernel)
         3   Library calls (functions within program libraries)
         4   Special files (usually found in /dev)
         5   File formats and conventions, e.g. /etc/passwd
         6   Games
         7   Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7), man-pages(7)
         8   System administration commands (usually only for root)
         9   Kernel routines [Non standard]


TIL about the `.N` suffix. I've always done `man N page`!


I don't think it's very portable, actually. It doesn't work on any of the BSDs or OpenSolaris forks IIRC. A yet another alternative spelling `man 'select(2)'` should work everywhere, but many shells require you to escape parentheses (or put them in quotes as above), so I don't find it very useful.


Here is an important distinction to make: contrary to NT, syscalls in a given thread do not call back to userspace in the same thread.

In NT, KeUserModeCallback implements some sort of a coroutine pattern, where the kernel space can call back to the user code synchronously in the same thread and expect it to return to the callsite. In Linux, there's no such thing as a user-mode callback: a given thread cannot have any user code running (or scheduled to be run) while simultaneously having the kernel call stack active.

> seccomp filters, FUSE, ptrace, userfaultfd, fanotify

> core_pattern handler

> usermode helpers

These either block the current thread waiting for another thread (e.g. a debugger, fault handler, or filesystem server) to complete the request, or fires off the request asynchronously.

> syscall_user_dispatch

This does not "call back" into userspace at all; it merely delivers SIGSYS. Signal delivery works more like NT's user APCs: there's no kernel-side context active in the thread the signal handler (or the APC) is running.


This is convenient but having the kernel block on userspace doing something is typically bad design :(


It really isn't. "The" kernel doesn't block, a context blocks and typically that context is associated with a user process (directly or indirectly).

sleep(3) the kernel blocks for a time specified by userspace. futex(2) can cause the kernel to block until usespace wakes it. Similar wait(2). Also read(2) from a pipe or localhost socket.

So whether or not it is a bad design depends highly on what gets blocked, in what situations, and what can be done to recover the situation if things misbehave.


Intel processors have a feature called Supervisor Mode Execution Prevention (SMEP) designed specifically to prevent that in most cases (because there's a decent chance the code might be attacker-controlled). It's totally optional, of course.

https://lwn.net/Articles/517475/


Arguably signals were the classic "upcall", and netlink is probably main mechanism for upcalls in current linux.

But NT's hidden VMS-esque API for upcalls would be somewhat nicer option at times (or Solaris Doors)


Something I'd like to have: a magic sysrq combo to pause all processes and run a predefined binary. This would allow me to, when the userspace if broken, do something useful and save the jobs running on the machine. Of course, some way to prevent it from turning it into a DDoS is important, of course.


Something like stop_machine() along with freeze_processes() might get you fairly close to that.


Look at criu maybe?

https://criu.org/


This seems very close to hibernate to disk, so seems the plumbing is already there


how else would init get started?


After bootstrapping, the kernel sits in its idle task and does absolutely nothing except serving interrupts, the scheduler only switching between kernel threads at most. Then, after a set amount of time (5 minutes), another device with full DRAM access (or sufficiently configured IOMMU) and in the same coherency domain (or some contraption to make things coherent) starts to DMA the init task and the appropriate changes to the kernel's data structures into memory. When made coherent, the kernel's scheduler will see the new init task's thread in its set of runnable threads and eventually switch that.

What? You did not ask for a method that is not completely batshit insane.


Except given the size of caches nowadays, without some cause, that coherency might never happen :-)


Yep. That’s why I kind of copped out with just mentioning that the device has to either be in the same coherency domain, or there must be some super vague hand-wavy “contraption” to achieve coherency. Be creative I guess! :P


This isn't new, right? Firmware loaders and hardware hotplug events (when not using sysfs' netlink firehose) are also spawned directly from the kernel, IIRC?


In those cases you mentioned the kernel has spawned a process loaded from normal ELF executable from filesystem, just like you could do with exec(). But in this bpfilter_umh case the executable is not coming from the filesystem but instead is embedded in the kernel itself. That seems to be quite new invention.


The interest is warranted, but it was been true for a long time that the kernel spawns processes where the executable is in the kernel itself.

https://www.quora.com/What-is-kswapd0-in-Linux-Kernel

https://stackoverflow.com/questions/9154042/where-can-i-find...

Here's some official documentation on it:

https://tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO-8.h...


Those are kernel threads and not considered to be like spawning a process.


"which means it’s probably safe to trust the process’s title and, well, that says bpfilter_umh, so there you have it. (Just saying: Don’t trust process titles in general. They’re easy to manipulate.)"

And then there's this: https://www.bleepingcomputer.com/news/security/bpfdoor-steal...


This seems similar to nfsd, which is only present in many UNIX implementations to prompt the kernel to service NFS requests.

A lot of (local) filesystems seem to do this as well (XFS in particular).


A lot of kernel code spawn threads. I was not aware that it can also spawn usermode processes running code embedded into the kernel. That should be advertised and developed more IMO. Tons of drivers should have nothing to do in kernelmode.


I wonder what the parent PID of these processes is.


That's what the "kthreadd" kernel process is for.

Try this command:

  ps axo pid,ppid,comm | less
The kernel-created processes will all have "kthreadd" as their Parent Process ID (PPID). Typically init and kthreadd are the only processes with a PPID of 0.


Yep, and aiui kthreadd on Linux always has a PID of 2. If you are ever trying to filter out kernel processes from ps output (and usermode helpers, apparently), you can check if the pid is 2 or ppid is 2.


init has a ppid of 0, which effectively is a special value that means the kernel. These processes probably also do the same thing (and thus, ppid=0 is a good way to detect kernel-started processes).


The other kernel workers get spawned by `kthreadd`, which has PPID = 0, PID = 2 under systemd.


I think they could be spawned with ppid 1 as well.


I want to stop doing -j with make, and instead allow the OS scheduler --- which alone is in the position to make the right call ---- to "pull" more batch jobs to run as system resources allow.

Perhaps one can do this or almost do this with io_uring?


That would be a really neat feature. I do run slurm on single nodes and sbatch with dependencies to emulate some of the effect you are asking for.


Glad to hear you think so too!


The reason it’s pushed down to make is the OS scheduler doesn’t know what gcc is.

make, of course, doesn’t know what gcc is either but at least it has a better chance of learning.

(The OS does know what resources current processes are using, but that doesn’t help because it doesn’t know what any future process will do. Designs like GCD solve this with priority levels, but that doesn’t help because all your make tasks are the same high priority.)


> which alone is in the position to make the right call

Why is the OS scheduler alone in a better position than make itself to make the right call?


Everyone else is guessing how may "fork" commands to throw out the OS. The OS itself has much more information at its disposal about what resources are being used, where the bottle necks are etc.

An elegant design on doing this in userspace probably boils down to going well on the awy to making a micro kernel.


Specifically one failure mode I've seen with Ninja is that it starts 12 processes for my 12-threaded CPU, each process eats 2 gigabytes of RAM, and I only have 16 gigabytes of RAM total, so it sends my system into swap or oomkiller. Ideally when the RAM fills up (whether because of compilers spawned by Ninja, or because I opened Discord), Ninja would kill one of the build processes and send it back onto the build queue, and reduce the number of compilers running at a time by 1 until I either tell it, or (some heuristic like) over 1.5 times the maximum RAM taken by a single compiler so far gets freed up.

I don't know if this is best handled by the OS, userspace Ninja, or some combination (like cgroups).


This sounds a lot like the idea behind Dispatch[0] (aka, Grand Central Dispatch) on Apple systems.

[0] https://developer.apple.com/documentation/DISPATCH


I don't think either make or the OS have enough understanding to do a perfect job here.

Think about the case where you have 32 CPUs, and so make starts 32 of the first tasks in the dependency graph. 31 of those tasks are routine compilations, but 1 task computes some data that a ton of other compilations need. What make and the OS don't know is that task would be 32x faster if was using all 32 CPUs. So by starting 31 random compilations, it's actually increasing the end-to-end latency dramatically, because so many things depend on that 1 job running as quickly as possible.

make and the OS basically only solve one tiny part of the problem of fast builds; maximize cpu utilization right now. They don't have enough data to make the latency of the build faster. It's very much a "good enough" approach; lots of people have faster builds because of "make -j32", but it could hardly be considered general enough to guarantee the fastest possible end-to-end build time. (And even things like Bazel don't quite have enough information to do this. Imagine you have some test that does "sleep 60" in the middle; you really want parallelism to go up to 33 when the test enters that state, and you want to ensure that you start it 60 seconds before the rest of the tests will be finished. There isn't enough data in the BUILD file to do that, though, so you're just stuck.)

Anyway, one thing you can do is get more memory and set -j higher than the number of cores/vCPUs. I think the kernel does a good job ensuring that each job makes progress, so if you have a scenario where something like a test benefits from starting as early as possible in the build and then basically goes to sleep, you'll still use 100% of your CPU, but make progress on that wait. The downside is that these things use a lot of RAM. I remember trying to build Envoy on a 64 vCPU machine once; each C++ process used over a gig of RAM, and I simply didn't have enough. I got more, and found that running 64 parallel compiles with SMT was about 10% faster than 32 parallel compiles. But, requires double the RAM (think it was about 80GB).

TL;DR lots of constraints to balance and optimize. Make is probably going to be the worst performing build system that you can use, but the field is hardly a "solved problem", and further research could probably yield general improvements. Also, buy yourself a lot of RAM when it's cheap. Having some code paged in and ready to go when the CPU is free is good for throughput.


Yes it would be nice to share that information with the scheduler too. This would be were some sort of microkernel userland scheduler one can augment with arbitrary information would do well.


if i recall correctly, this is basically the premise of apple’s Grand Central Dispatch API. which, unfortunately, never seemed to go anywhere..


Interesting find. The kernel has all kinds of weird and wacky things in drivers.


I learned about the existence of usermode helpers last year by working with coredump helpers (/proc/sys/kernel/core_pattern) and found it quite interesting too. The modes where the kernel spawns a new process on behalf of the users (or as init process) are quite obvious. But this case - where the kernel just executes a helper program in userspace which then immediately terminates - are slightly different. It's a neat design - because it allows for customizing kernel behavior without having to compile a custom kernel.


If you use SELinux, you'd see that it has the 'kernel' domain.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: