Hacker News new | past | comments | ask | show | jobs | submit login
Out-of-memory victim selection with BPF (lwn.net)
83 points by mfrw on Aug 21, 2023 | hide | past | favorite | 37 comments



Chrome does a good job of killing tabs when oom, but Firefox seems to just hang the whole system. Is there a way to make FF more murderous?

edit: I found about:unloads which shows tabs memory usage and unload order


> edit: I found about:unloads which shows tabs memory usage and unload order

I've just taken a look at that and it's weird. It says all my tabs have been last accessed "right now". If I click refresh, the last access time updates to that moment. This is Firefox 116.0.3 on Linux.

I wonder if this could be related to my using a single tab per window (I have a window manager whose job is to... manage windows).


Definitely related. Any tab that’s currently visible has to remain loaded, and windows are considered visible even if they’re covered by other windows.


> windows are considered visible even if they’re covered by other windows

This is interesting. I usually use firefox with i3 in tabbed mode, so only one window visible at a time. On some sites, when I switch to another tab that hides the previous one, any sound playing would stop. But if I display the two windows side by side, switching to the other window wouldn't stop the sound in the initial window. So I'm guessing that window visibility must be somehow communicated to the browser, if websites can act on it.

Edit: this also happens with chromium in the same configuration: I have Teams running in PWA mode on the same desktop as Outlook, also in PWA mode, both tiled by the WM. Teams will get suspended after a while if it's not the selected tab, but it'll stay active if it's on top.


Is there a good reason why the app would not track window visibility? I have only used the X API for this, but I would have assumed all GUIs have such an API.


It may not be able to if the window manager doesn't tell the window it's hidden . Modern WMs that support composting and window thumbnails tend to not do that anymore. It does usually also have a nice side effect of also hiding if the user is idle or not to web pages/apps


In my specific case, i3 must tell it, since some sites know to stop playing sound when the window is hidden, but keep playing when I switch to a separate window that doesn't obscure the noisy one.


100 MB memory per learn.microsoft.com tab, for a bit of what feels like entirely static (a good thing) HTML.


What are situations where you might want to use this? It is easy enough to set oom_score_adj from regular userspace code, the only time you would benefit from computing it dynamically in the moment is if that score changes rapidly. Does that happen?


One situation might be that you roughly know how much memory your process should be consuming and kill it if it's overconsuming. Something akin to but more dynamic than memory limits.


That's probably something you would want to do as soon as the process is outside its limits, without waiting for a OOM situation.


That could end up with significant false-positives when updating the thing without updating the estimated memory usage (especially if slow bounded fragmentation over a long period of time applies). You might also have some expected ratio of memory usage (e.g. process B uses 3x the memory of process A), but want to allow the absolute usage to grow as more data is processed.


Having the wrong limits will cause the wrong thing to die, regardless of whether they kick in during OOM or orchestration. Better to find out during your planned rollout window if you ask me.


Right, but you may not catch it before the OOM situation arises I guess?


I always used `ulimit -v` and `ulimit -H -v` for this purpose. Maybe this is more fine grained?


One situation that I think is seen more often is where a child process or thread is consuming way too much memory (large dataset, memory leak etc.) and it would be beneficial to just kill that particular process/thread and not the parent.


That's the default behavior of the OOM killer since the start.


In Kubernetes world, you may want to kill the pods and keep all tasks from control panel (or whatever it is called).


Context: the patch author is from bytedance -- the Chinese company behind tiktok.

I can imagine this can be useful in server farms

if you can read Chinese, see also: https://twitter.com/Manjusaka_Lee/status/1692084961330028802...


I can't, but tesseract+deepl does wonders.

Text from image, cut to abberviate:

> 1. the current usage scenario of bytes for such a feature. I can

I can think of is to avoid some critical processes like kubelet to be OOM to ensure better stability. I don't know if my understanding is I don't know if I'm wrong or not.

Yes, on the one hand, we want to make sure that important daemons like kubelet are not killed by mistake. On the one hand, we want to make sure that critical daemons like kubelet don't get killed by mistake; on the other hand, nowadays, big companies tend to run multiple types of tasks together to improve resource utilization, and in the case of OOM, we would like to make sure that we have the right resources. And when it comes to OOM, we want to make decisions based on quality of service requirements, not on memory consumption. In the past, Ali and Google used to prioritize their tasks by setting the user's priority on the memcg priority setting by users to achieve this function, but BPF may be more flexible and applicable to more scenarios.

> Another question I have is if we introduce BPF in OOM eviction scenarios.

I have another question of my own: if we introduce BPF in an OOM eviction scenario > program (in effect introducing a BPF

> program (in effect, introducing the runtime unpredictability of the additional programmability).

time unpredictability due to the additional programmability), might it not be possible that a system stability-peddling program such as OOM

could lead to a system stabilization scheme such as OOM not > working as expected, leading to more systemic problems.

> work as expected, leading to further destabilization of the system.

Yes. I haven't done the exact math on the extra overhead, but I'm optimistic. We can see that BPF has even done some work on kernel tuning. On the other hand, a user can easily write a kernel BPF to write a "bad" or even "malicious" BPF, which is really a matter of whether we trust the user completely. At the beginning of the implementation, I also did some backstabbing on the kernel side to make sure that at least one victim could be selected, but Michal probably thought this was unnecessary (https:/llore.kernel.org must kml/ZMzhDFhvol2VQBE4@dhcp22.suse.cz/).

Tweet text as seen on Nitter:

A patch from ByteDance that attempts to programmable the kernel's OOM eviction behavior via BPF. I find this interesting. @flaneur2023 Aug 17

This makes sense, adding an if else anyway don't kill this logic for OOM time would work pretty well NadeshikoManju @SwingingCamping S3 2024 On!

@Manjusaka_Lee Aug 17 Replying to @flaneur2023

Further discussion between me and the author Aug 17, 2023 - 8:04 AM UTC

[[image]]

@ayanamist Aug 17

Replying to @Manjusaka_Lee @flaneur2023

It should be mainly a mixed section claim, it's a pain in the ass to adjust the oom scores by process as well


That explanation doesn't really scan. It is the individual control groups that OOM. This new logic is for selecting a victim task in that control group. Processes in other pods, even in the same k8s namespace, would not be considered (or even visible, depending on the container runtime).


Isn’t the control pane and everything else running on physically separate nodes for that reason already? (By default)


oom_score_adj is really good at that already.


It is hilarious how leaking js in some ff window kills IDE first and then random processes inside containers and finally freezes desktop. I wonder if oom developers are sponsored by memory manufacturers because 16 gb not enough to open and keep sites you want in background.


I wonder if they'll ever make it possible to ask the user what they want to do...


Doesn't that kind of assume that

a) there is only one user

b) they are logged in

c) they will be able to respond in a timely enough manner to allow other programs to allocate memory shortly

edit: d) there will be enough memory to start a program that is responsible for asking the user, and if it needs to connect to a display server then the display server will be able to allocate memory for surfaces that program can draw to to display the request to the user.


Yes. You can make it work on desktop. Not really an option on the server of course.

I was indirectly pointing out how Linux on the Desktop is never really going to happen for normal people while the kernel devs ignore obvious desktop requirements like this.

Windows solved this problem over 20 years ago.


How has Windows solved the problem? If anything it's more irritating on Windows as it refuses to overcommit so in a lot of setups if you don't have a humongous swapfile you will get OOM errors on random processes before you've even filled up physical RAM.


You press ctrl-alt-del and it brings up a privileged interface that lets you manually choose programs to kill. I don't know the details but I guess it's high priority or pauses other programs or something, but it fairly reliably works even if the rest of the system is unresponsive.

Lack of overcommit probably helps a bit, but it's a bit of a red herring. The real thing that Linux lacks is a GUI integrated into the kernel. There's literally no way for Linux to do anything like this.

(It might not actually require a kernel GUI but it definitely requires special treatment of the GUI by the kernel and it doesn't do that.)


Windows doesn't have kernel-mode GUI. To the extent that task manager (and the ctrl-alt-del screen) is prioritized it's prioritized in user mode. And it's far from reliable in my experience: if the rest of the system is bogging down then task manager also tends to suffer, and I don't think I've ever seen it work in an OOM situation.

Windows does have an early-oom like solution where it pops up a warning if you are getting close to running out of memory with an option to kill the highest-memory application, and it will also automatically expand swap if you have allowed it to do so great option if you don't like free disk space, but if both those fail it will crash hard


Well, however it does it, it works pretty well.

This post suggests it has at least some fairly tight integration with the kernel (and that it does work quite reliably):

https://www.reddit.com/r/explainlikeimfive/comments/1pbctv/e...


Disk space is comparatively cheap. While it's a silly requirement, it's more usable than most Linux distros that burn every byte of free memory on caching and then turn the OOM killer loose to wreak havoc rather than release some cache.


linux will always try to release cache before it invokes the OOM killer: this is part of why the OOM behavior on linux can be so pathological: the 'cached' data includes the code in executable files on disk, and so evicting them from cache in close-to-oom situations will often completely bog down the system with disk thrashing before the oom-killer gets invoked (this is why earlyoom and the like are a thing).


A) and b) are not really a problem (you don't need solutions to be universal to be useful), c) is solvable (e.g. pause any programs trying to access unavailable memory), d) is trivial, implement it as daemon that has pre-allocated (and locked) sufficient memory


PSI and cgroups should make that possible for typical situations.


Indeed, found something in this vein https://github.com/hakavlad/nohang/issues/58


BPF?

Berkley Packet Filter?

14 years on here and I'm kinda alarmed by HN lately, it's gone from people reading 20% of the time to 1%. article is inscrutable, and literally the entire set of 30 comments are:

- Firefox memory usage.

- why decide dynamically?

- why not just ask the user?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: