Out-of-memory victim selection with BPF

totetsu · on Aug 21, 2023

Chrome does a good job of killing tabs when oom, but Firefox seems to just hang the whole system. Is there a way to make FF more murderous?

edit: I found about:unloads which shows tabs memory usage and unload order

vladvasiliu · on Aug 21, 2023

> edit: I found about:unloads which shows tabs memory usage and unload order

I've just taken a look at that and it's weird. It says all my tabs have been last accessed "right now". If I click refresh, the last access time updates to that moment. This is Firefox 116.0.3 on Linux.

I wonder if this could be related to my using a single tab per window (I have a window manager whose job is to... manage windows).

Filligree · on Aug 21, 2023

Definitely related. Any tab that’s currently visible has to remain loaded, and windows are considered visible even if they’re covered by other windows.

vladvasiliu · on Aug 21, 2023

> windows are considered visible even if they’re covered by other windows

This is interesting. I usually use firefox with i3 in tabbed mode, so only one window visible at a time. On some sites, when I switch to another tab that hides the previous one, any sound playing would stop. But if I display the two windows side by side, switching to the other window wouldn't stop the sound in the initial window. So I'm guessing that window visibility must be somehow communicated to the browser, if websites can act on it.

Edit: this also happens with chromium in the same configuration: I have Teams running in PWA mode on the same desktop as Outlook, also in PWA mode, both tiled by the WM. Teams will get suspended after a while if it's not the selected tab, but it'll stay active if it's on top.

mcculley · on Aug 21, 2023

Is there a good reason why the app would not track window visibility? I have only used the X API for this, but I would have assumed all GUIs have such an API.

simcop2387 · on Aug 21, 2023

It may not be able to if the window manager doesn't tell the window it's hidden . Modern WMs that support composting and window thumbnails tend to not do that anymore. It does usually also have a nice side effect of also hiding if the user is idle or not to web pages/apps

vladvasiliu · on Aug 21, 2023

In my specific case, i3 must tell it, since some sites know to stop playing sound when the window is hidden, but keep playing when I switch to a separate window that doesn't obscure the noisy one.

diarrhea · on Aug 21, 2023

100 MB memory per learn.microsoft.com tab, for a bit of what feels like entirely static (a good thing) HTML.

remram · on Aug 21, 2023

What are situations where you might want to use this? It is easy enough to set oom_score_adj from regular userspace code, the only time you would benefit from computing it dynamically in the moment is if that score changes rapidly. Does that happen?

krab · on Aug 21, 2023

One situation might be that you roughly know how much memory your process should be consuming and kill it if it's overconsuming. Something akin to but more dynamic than memory limits.

remram · on Aug 21, 2023

That's probably something you would want to do as soon as the process is outside its limits, without waiting for a OOM situation.

dzaima · on Aug 21, 2023

That could end up with significant false-positives when updating the thing without updating the estimated memory usage (especially if slow bounded fragmentation over a long period of time applies). You might also have some expected ratio of memory usage (e.g. process B uses 3x the memory of process A), but want to allow the absolute usage to grow as more data is processed.

remram · on Aug 23, 2023

Having the wrong limits will cause the wrong thing to die, regardless of whether they kick in during OOM or orchestration. Better to find out during your planned rollout window if you ask me.

mort96 · on Aug 21, 2023

Right, but you may not catch it before the OOM situation arises I guess?

omoikane · on Aug 21, 2023

I always used `ulimit -v` and `ulimit -H -v` for this purpose. Maybe this is more fine grained?

ExoticPearTree · on Aug 21, 2023

One situation that I think is seen more often is where a child process or thread is consuming way too much memory (large dataset, memory leak etc.) and it would be beneficial to just kill that particular process/thread and not the parent.

remram · on Aug 21, 2023

That's the default behavior of the OOM killer since the start.

j16sdiz · on Aug 21, 2023

In Kubernetes world, you may want to kill the pods and keep all tasks from control panel (or whatever it is called).

j16sdiz · on Aug 21, 2023

Context: the patch author is from bytedance -- the Chinese company behind tiktok.

I can imagine this can be useful in server farms

if you can read Chinese, see also: https://twitter.com/Manjusaka_Lee/status/1692084961330028802...

folmar · on Aug 21, 2023

I can't, but tesseract+deepl does wonders.

Text from image, cut to abberviate:

> 1. the current usage scenario of bytes for such a feature. I can

I can think of is to avoid some critical processes like kubelet to be OOM to ensure better stability. I don't know if my understanding is I don't know if I'm wrong or not.

Yes, on the one hand, we want to make sure that important daemons like kubelet are not killed by mistake. On the one hand, we want to make sure that critical daemons like kubelet don't get killed by mistake; on the other hand, nowadays, big companies tend to run multiple types of tasks together to improve resource utilization, and in the case of OOM, we would like to make sure that we have the right resources. And when it comes to OOM, we want to make decisions based on quality of service requirements, not on memory consumption. In the past, Ali and Google used to prioritize their tasks by setting the user's priority on the memcg priority setting by users to achieve this function, but BPF may be more flexible and applicable to more scenarios.

> Another question I have is if we introduce BPF in OOM eviction scenarios.

I have another question of my own: if we introduce BPF in an OOM eviction scenario > program (in effect introducing a BPF

> program (in effect, introducing the runtime unpredictability of the additional programmability).

time unpredictability due to the additional programmability), might it not be possible that a system stability-peddling program such as OOM

could lead to a system stabilization scheme such as OOM not > working as expected, leading to more systemic problems.

> work as expected, leading to further destabilization of the system.

Yes. I haven't done the exact math on the extra overhead, but I'm optimistic. We can see that BPF has even done some work on kernel tuning. On the other hand, a user can easily write a kernel BPF to write a "bad" or even "malicious" BPF, which is really a matter of whether we trust the user completely. At the beginning of the implementation, I also did some backstabbing on the kernel side to make sure that at least one victim could be selected, but Michal probably thought this was unnecessary (https:/llore.kernel.org must kml/ZMzhDFhvol2VQBE4@dhcp22.suse.cz/).

Tweet text as seen on Nitter:

A patch from ByteDance that attempts to programmable the kernel's OOM eviction behavior via BPF. I find this interesting. @flaneur2023 Aug 17

This makes sense, adding an if else anyway don't kill this logic for OOM time would work pretty well NadeshikoManju @SwingingCamping S3 2024 On!

@Manjusaka_Lee Aug 17 Replying to @flaneur2023

Further discussion between me and the author Aug 17, 2023 - 8:04 AM UTC

[[image]]

@ayanamist Aug 17

Replying to @Manjusaka_Lee @flaneur2023

It should be mainly a mixed section claim, it's a pain in the ass to adjust the oom scores by process as well

jeffbee · on Aug 21, 2023

That explanation doesn't really scan. It is the individual control groups that OOM. This new logic is for selecting a victim task in that control group. Processes in other pods, even in the same k8s namespace, would not be considered (or even visible, depending on the container runtime).

diarrhea · on Aug 21, 2023

Isn’t the control pane and everything else running on physically separate nodes for that reason already? (By default)

remram · on Aug 21, 2023

oom_score_adj is really good at that already.

Aleklart · on Aug 21, 2023

It is hilarious how leaking js in some ff window kills IDE first and then random processes inside containers and finally freezes desktop. I wonder if oom developers are sponsored by memory manufacturers because 16 gb not enough to open and keep sites you want in background.

IshKebab · on Aug 21, 2023

I wonder if they'll ever make it possible to ask the user what they want to do...

Karellen · on Aug 21, 2023

Doesn't that kind of assume that

a) there is only one user

b) they are logged in

c) they will be able to respond in a timely enough manner to allow other programs to allocate memory shortly

edit: d) there will be enough memory to start a program that is responsible for asking the user, and if it needs to connect to a display server then the display server will be able to allocate memory for surfaces that program can draw to to display the request to the user.

IshKebab · on Aug 21, 2023

Yes. You can make it work on desktop. Not really an option on the server of course.

I was indirectly pointing out how Linux on the Desktop is never really going to happen for normal people while the kernel devs ignore obvious desktop requirements like this.

Windows solved this problem over 20 years ago.

rcxdude · on Aug 21, 2023

How has Windows solved the problem? If anything it's more irritating on Windows as it refuses to overcommit so in a lot of setups if you don't have a humongous swapfile you will get OOM errors on random processes before you've even filled up physical RAM.

IshKebab · on Aug 21, 2023

You press ctrl-alt-del and it brings up a privileged interface that lets you manually choose programs to kill. I don't know the details but I guess it's high priority or pauses other programs or something, but it fairly reliably works even if the rest of the system is unresponsive.

Lack of overcommit probably helps a bit, but it's a bit of a red herring. The real thing that Linux lacks is a GUI integrated into the kernel. There's literally no way for Linux to do anything like this.

(It might not actually require a kernel GUI but it definitely requires special treatment of the GUI by the kernel and it doesn't do that.)

rcxdude · on Aug 21, 2023

Windows doesn't have kernel-mode GUI. To the extent that task manager (and the ctrl-alt-del screen) is prioritized it's prioritized in user mode. And it's far from reliable in my experience: if the rest of the system is bogging down then task manager also tends to suffer, and I don't think I've ever seen it work in an OOM situation.

Windows does have an early-oom like solution where it pops up a warning if you are getting close to running out of memory with an option to kill the highest-memory application, and it will also automatically expand swap if you have allowed it to do so great option if you don't like free disk space, but if both those fail it will crash hard

IshKebab · on Aug 22, 2023

Well, however it does it, it works pretty well.

This post suggests it has at least some fairly tight integration with the kernel (and that it does work quite reliably):

https://www.reddit.com/r/explainlikeimfive/comments/1pbctv/e...

Karunamon · on Aug 21, 2023

Disk space is comparatively cheap. While it's a silly requirement, it's more usable than most Linux distros that burn every byte of free memory on caching and then turn the OOM killer loose to wreak havoc rather than release some cache.

rcxdude · on Aug 21, 2023

linux will always try to release cache before it invokes the OOM killer: this is part of why the OOM behavior on linux can be so pathological: the 'cached' data includes the code in executable files on disk, and so evicting them from cache in close-to-oom situations will often completely bog down the system with disk thrashing before the oom-killer gets invoked (this is why earlyoom and the like are a thing).

zokier · on Aug 21, 2023

A) and b) are not really a problem (you don't need solutions to be universal to be useful), c) is solvable (e.g. pause any programs trying to access unavailable memory), d) is trivial, implement it as daemon that has pre-allocated (and locked) sufficient memory

zokier · on Aug 21, 2023

PSI and cgroups should make that possible for typical situations.

zokier · on Aug 21, 2023

Indeed, found something in this vein https://github.com/hakavlad/nohang/issues/58

refulgentis · on Aug 21, 2023

BPF?

Berkley Packet Filter?

14 years on here and I'm kinda alarmed by HN lately, it's gone from people reading 20% of the time to 1%. article is inscrutable, and literally the entire set of 30 comments are:

- Firefox memory usage.

- why decide dynamically?

- why not just ask the user?