Use of eBPF in CPU Scheduler

nsm · on Sept 16, 2021

Whenever I read news about the kernel surfacing more traditionally OS level features to user space, I'm reminded of Stephen Kell's "Unix, Plan 9 and the Lurking Smalltalk"[1].

There are a lot of parallels between the Smalltalk VMs (where nearly everything about the system is editable and accessible to the user) and where Linux is heading in certain respects, except going there by a very round-about way and relying on very limited binary formats and C instead of high level languages and live objects.

[1]: https://www.humprog.org/~stephen//research/papers/kell19unix...

marcodiego · on Sept 16, 2021

IIRC, there were some efforts years ago to bring memory management and scheduling to user space. I think those efforts failed mostly because they were very intrusive and imposed performance regressions when not in use. If scheduling is finally allowed from user space without negative effects it will bring a new age of experimentation and finely tuned custom adaptations for the kernel.

Other recent related developments and news:

   https://github.com/rgushchin/atc

   https://www.phoronix.com/scan.php?page=news_item&px=Linux-BPF-Scheduler

   https://lore.kernel.org/bpf/CA+khW7i460ey-UFzpMSJ8AP9QeD8ufa4FzLA4PQckNP00ShQSw@mail.gmail.com/T/#u

nexuist · on Sept 16, 2021

Wouldn't putting the scheduler in userspace double the amount of context switching? You'd have to interrupt a userspace thread and jump to the kernel, then jump from the kernel to the scheduler program (switch 1), then jump from the scheduler back to the kernel, and then jump from the kernel to the next scheduled thread (switch 2).

Having the kernel accessible from userspace as it already is saves two steps, but you still have to use the kernel as a middleman unless there's some way for the scheduler program to directly jump into another process (which I don't think there is because wouldn't this break memory protection?)

jeffbee · on Sept 16, 2021

You kind of have to view this work in light of the fact that Google internally uses their private SwitchTo API that allows userspace to invoke a thread directly and much more efficiently than the kernel does.

They tried to upstream it but kernel maintainers lack the industry experience that would allow them to perceive the benefits.

https://lkml.org/lkml/2020/7/22/1202

Apparently now known as UMCG.

https://lwn.net/Articles/863386/

marcodiego · on Sept 16, 2021

A think a more acceptable approach would a system call with two arguments. Both arguments are lists of task_id, task descriptors, pids, whatever uniquely identify a task.

The first argument would be a list of tasks to be run before the syscall returns. The second argument would be a list of tasks to be run in that order (with the possibility of unlisted tasks being run between them) before the syscall returns.

This could not only replace switchTo, it also would enable scheduling from user space with a very small number of context switches. I'd call this syscall "schedule".

scottlamb · on Sept 16, 2021

That's certainly a more complex interface, but I don't think it enables anything more.

I think switchto (which became futex_swap which became umcg_swap) from thread X is roughly "suspend the current thread (X) and unsuspend Y", where suspend is "wait on a futex". When all of the tasks you propose specifying are done, one of them can just unsuspend X (possibly suspending itself at the same time). (Caveat: I don't really know what changed between futex_swap and umcg_swap.)

btw, I don't think it's fair to say the kernel maintainers lacked the industry experience to perceive the benefits. I've glanced at some of the UMCG threads and it seems like reviewers have legitimate feedback aimed at improving the patchset rather than rejecting it. Assuming posk has enough time to keep making new drafts, I expect it to be merged eventually.

jeffbee · on Sept 17, 2021

Eh, I take a dim view of the largely superficial reactions this patch set has garnered. The maintainers think their aesthetic concerns are more important than the Linux-using public getting access to this api. It doesn’t seem to me like years of debate have improved it, just delayed it.

marcodiego · on Sept 16, 2021

Note: I'm far from an expert and I'm purely speculating here hoping for educated commenters to show me why I'm wrong.

The described approach allow, for example, for a process to choose how its children will be scheduled. I don't think google's proposal allows that.

scottlamb · on Sept 16, 2021

Yes, I think you're right that UMCG is just for within a single process. It's meant to support Google's fibers library, which offers a nice structured concurrency programming model within a process with nested schedulers. Eg, the outermost scheduler might be earliest fiber tree gets scheduled first, and then within a fiber tree (which often represents one request) the scheduler is typically LIFO for efficiency. The idea is that when using this library, the process has more information about what its threads are doing, so it can make better scheduling decisions than the kernel and can do it more quickly. It doesn't replace the kernel scheduler exactly, just makes it have less to do because there are fewer unblocked threads per process.

This ghOSt thing looks unrelated and system-wide. Like a more customizable version of the usual kernel scheduling. But I'm guessing based on one paragraph and will be interested to see the paper/presentation.

I'm not sure if there's any big gap between where neither does what you want.

derefr · on Sept 16, 2021

What if it's a scheduler daemon running whenever-it-happens-to on one core, making policy decisions about when to preempt processes running on other cores (and which sleeping processes to schedule in their place onto those cores), and then queuing those decisions up into a "dumb" kernel-side scheduler using e.g. an io_uring?

Then the actual kernel-side scheduling process would just involve 1. the kernel popping things off the "to preempt" queue and setting interrupts on those cores; and 2. the preemption interrupt handler popping a process-block ref off the "next workload on this core" queue and switching to it. With no actual decision-making happening at preempt time.

(This is irrelevant to the greater point, though, because eBPF isn't a userspace process in the first place.)

johntb86 · on Sept 16, 2021

You also need to handle the synchronous IPC case, where a thread wakes up another thread and goes to sleep until some event happens (e.g. the other thread finishes running). You still need to intelligently choose what thread to wake up in that case, and ideally you don't have to wait for another core to make that decision.

Matthias247 · on Sept 17, 2021

That's true. But I guess it could be handled by addressed by either the IPC invoker also placing the info on which process to wake into a BPF map before doing the blocking call, and the kernel code using the info to make the scheduling decision. That would however require an additional syscall per IPC. But maybe that informmation could be either cached or derived based on some other BPF visible state, so that it gets cheaper?

zekrioca · on Sept 16, 2021

You have to see it as a way to enable applications to inform and steer the kernel towards their scheduling policies. For instance, one of these policies could very well be the current Completely Fair Scheduler (CFS) Linux already uses[1]. And alternatively, you can have any other policy that reduces context switching, if this is your target.

[1] https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

delusional · on Sept 16, 2021

I think he's asking if implementing CFS in userspace, with no other changes, wouldn't in itself cause more context switches. I don't understand how your comment addresses that point. Sure you could implement something different, but you could do that without the userspace component too, and adding the userspace component is pretty expensive.

lallysingh · on Sept 16, 2021

Nope, you want information to go from userspace to scheduling decisions. Where the code resides is implementation. Having userspace tell the scheduler what it needs avoids the round trip if it applies to more than one scheduling decision.

bostonsre · on Sept 16, 2021

BPF scripts run in kernel space.

marcodiego · on Sept 16, 2021

I think not if you use eBPF.

indolering · on Sept 17, 2021

You are correct that all attempts at bringing scheduling out of kernel space have failed, even the L4 folks found the performance impact intolerable. However, this uses eBPF to perform logical operations within kernel space that are loaded from user space at runtime.

See https://ebpf.io/ for more info.

PaulDavisThe1st · on Sept 16, 2021

Possibly more interesting (and a base for this work):

https://netdevconf.info/0x15/slides/25/ghOSt%20Talk%20(Netde...

"Fast & Flexible User-Space Delegation of Linux Scheduling"

from Google, July 2021

jeffbee · on Sept 16, 2021

50% lower tail latency for web search is a hell of a result to be buried so deep in an obscure paper.

neelnatu · on Sept 16, 2021

An updated version of the paper will be presented at SOSP'21: https://sosp2021.mpi-sws.org/accepted.html

jeffbee · on Sept 16, 2021

Would be neat if there were materials. Code for this has just started to trickle out earlier this year at https://github.com/google/ghost-userspace and https://github.com/google/ghost-kernel, but Google has been talking in public about userspace scheduling since 2013.

neelnatu · on Sept 16, 2021

https://netdevconf.info/0x15/session.html?ghOSt:-Fast-&-Flex...

javert · on Sept 16, 2021

Is there a paper? I'm mostly interested in ghOSt itself. I already found the github projects [1] [2].

[1] https://giters.com/google/ghost-userspace

[2] https://github.com/google/ghost-kernel

kzrdude · on Sept 16, 2021

So, is this finally an opening to plugging in some scheduling customizations, even to the degree of the -ck (Con Kolivas) kernels? Since he's recently retired his maintenance of those, it seems topical.

Matthias247 · on Sept 16, 2021

I’m curious how the ghOSt work presented here relates to the user managed concurrency groups (umcg) work that also allows userspace to take scheduling decisions. Seems like both originate at google.