> If you can't on your particular Unix, I'd actually say that your Unix is probably not letting you get full use out of your RAM.
What use is "full use" if your system live locks!? This is similar logic to that of overcommit--more "efficient" use of RAM in most cases, with the tiny, insignificant cost of your processes randomly[1] being killed?
What happens in practice is that people resort to over provisioning RAM anyhow. But even then that's no guarantee that your processes won't be OOM-killed. We're seeing precisely that in serious production environment--OOM killer shooting down processes even though there's plenty (e.g. >10%, tens of gigabytes) of unallocated memory because it can't evict the buffer cache quickly enough--where quickly enough is defined by some magic heuristics deep in the eviction code.
[1] And before you say that it's not random but can be controlled... you really have no idea of the depth of the problem. Non-strict, overcommit-dependent logic is baked deep into the Linux kernel. There are plenty of ways to wedge the kernel where it will end up shooting down the processes that are supposed to be the most protected, sometimes shooting down much of the system. In the many cases people simply reboot the server entirely rather than wait around to see what the wreckage looks like. This is 1990s Windows revisited--reboot and move along, unreliability is simply the nature of computers....
> This is 1990s Windows revisited--reboot and move along, unreliability is simply the nature of computers....
Ugh, so much this. Where I work (until the 16th) we've moved from an environment of stability, where problems are investigated and fixed and stay fixed, to our acquirer's environment where things break randomly at all times, and it's not worth investigating system problems because nothing will stay fixed anyway. #totallynotbitter
This is that part of being professional where you want to just keep working until your last day, or come in with your feet kicked up and do nothing with the "what are they going to do, fire me?" attitude. If there's severance, I'd go with the former. If everything is settled and it's just waiting out your time, try the latter?
Well, Linux isn't supposed to live lock under memory pressure. It's supposed to OOM kill the process that's causing the most problems. The bug (discussed in the mailing list thread the post links to) is that the Linux kernel isn't properly detecting live lock conditions and activating OOM-killer. The reason seems to be that for SSD users paging out file-backed memory happens quickly enough that the kernel still detects it as activity and doesn't know the system has locked up.
Arguably we currently have three options to prevent livelocking (short of fixing the kernel). All three have significant cons.
1. Have a big swap partition. Historically people have recommended not to have swap at all because (arguably) the kernel was too swappy and would swap out stuff that was needed. And also because some people prefer misbehaving processes to get OOM-killed instead of slowing the system to a crawl on old HDDs. I was in the latter camp, but I'm now experiencing this bug too (nothing gets OOM-killed, the system locks up instead) and considering reenabling swap.
2. Use the new memory pressure information the kernel provides with a userspace tool like earlyoom to kill misbehaving processes before the pressure is significant enough to slow the system. I tried this one out, but earlyoom repeatedly killed the X server under low memory conditions. On a desktop system, X is likely the parent process of everything you want to run, but this still might be a bug?
3. Disabling overcommit entirely. This supposedly breaks a bunch of userspace programs, and even assumptions made in the kernel itself. Worse still, it still results in processes probably killing themselves anyway when memory requests fail, but it doesn't have the advantage (on a correctly working Linux system) that the worst behaving process on the system gets killed instead of whoever happened to request too much memory.
Just do sysctl vm.panic_on_oom=1
Because what are you going to OOM-kill, some non-important process? Why would you run non-important processes in the first place :) And random OOM-killing is plain crazy. Just reboot and start anew.
> And random OOM-killing is plain crazy. Just reboot and start anew.
Are you arguing that killing 100% of the running programs and losing all the data is superior to kill <100% of the programs?
> Because what are you going to OOM-kill, some non-important process? Why would you run non-important processes in the first place.
Because I misclicked my mouse. It's very common to create a disaster by mistakes, for example, accidentally open a 10 GiB file in a web browser or a text editor. In this case, the OOM algorithm will always kill Firefox without affecting others, and it's preciously what I need. Killing Firefox (due to a webpage with runaway JavaScript) while keeping my unsaved text editor running is desirable, even if it's not guaranteed, but statistically much better than killing the computer.
I often use Alt + Sysrq to manually activate the OOM killer, in case the kernel doesn't automatically detect memory exhaustion fast enough. And it works pretty nice for me. I found that, at the worst case scenario, nearly the entire desktop will be killed (rarely occurs, often it simply kills the offending process), but I can still restart my desktop in a minute, instead of spending five minutes to reboot the hardware.
My hunch regarding overcommit is that Linux should sort out this situation, making disabled-overcommit a first-class scenario.
We (application developers) will follow and adjust our programs to correctly handle malloc() failure -- after all it's quite easy to fix that even in existing applications.
One thing that's needed is efficient ways to ask for, and release, memory from the OS. I feel like Linux isn't doing so well on that front.
For example, Haskell (GHC) switched from calling brk() to allocating 1 TB virtual memory upfront using mmap(), so that allocations can be done without system calls, and deallocations can be done cheaply (using madvise(MADV_FREE)). Of course this approach makes it impossible to observe memory pressure.
Many GNOME applications do similar large memory mappings, apparently for security reasons (Gigacage).
It seems to me that these programs have to be given some good, fast APIs so that we can go into a world without overcommit.
>We (application developers) will follow and adjust our programs to correctly handle malloc() failure -- after all it's quite easy to fix that even in existing applications.
Can you elaborate on why this is easy? It seems really difficult to me. Wouldn't you need to add several checks to even trivial code like `string greeting = "Hello "; greeting += name;`, because you need to allocate space for a string, allocate stack space to call a constructor, allocate stack space to call an append function, allocate space for the new string?
Even Erlang with its memory safety and its crash&restart philosophy kills the entire VM when running out of memory.
>Haskell (GHC) switched from calling brk() to allocating 1 TB virtual memory upfront
The choice of 1TB was a clever one. Noobs frequently confuse VM for RAM, so this improbably large value has probably prevented a lot of outraged posts about Haskell's memory usage.
For many low-level languages, it's simply a matter of finding all malloc()s, checking the return value, and failing as appropriate. That can mean "not accepting this TCP request", "not loading this file", "not opening the new tab".
Or in the worst case, terminating the program (as opposed to letting Linux thrashing/freezing the computer for 20 minutes); but most programs have some "unit of work" that can be aborted instead of shutting down completely.
Adding those checks is some effort of plumbing, sure, but not terribly difficult work.
> Wouldn't you need to add several checks
In the case of C++, I'd say it's even easier, because malloc failure throws std::bad_alloc, and you can handle it "further up" conveniently without having to manually propagate malloc failure up like in C.
> Even Erlang with its memory safety
Memory safety is quite different from malloc-safety (out-of-memory-safety) though. In fact I'd claim that the more memory-safe a language is (Haskell, Java, or Erlang as you say), the higher the chance that it doesn't offer methods to recover from allocation failure.
Theoretically, sure. If you've got your process trees setup properly, anytime a process ran into a failed allocation, you could just kill that process and free its memory. And if an ets table insertion fails allocation, kill the requestor and the owner of the table.
The problem is, I know for sure my supervison trees aren't proper; and i have doubts about the innermost workings of OTP --- did the things they reasonably expected to never fail get supervised properly? Will my code that expects things like pg2 to just be there work properly if it's restarting? How sad is mnesia going to be if it loses a table?
I'm much happier with too much memory used, shut it all down and start with a clean slate.
You can disable the oom killer if you think its heuristics are misfiring.
> OOM killer shooting down processes even though there's plenty (e.g. >10%, tens of gigabytes) of unallocated memory because it can't evict the buffer cache quickly enough
Is that due to dirty pages? Have you tried tweaking the dirty ratio to get dirty files flushed sooner? I believe there were also some improvements in the default behavior towards the end of the 4.x series that are supposed to result in more steady flushing.
> You can disable the oom killer if you think its heuristics are misfiring.
You can't completely disable the OOM killer. Some in-kernel allocations will trigger an OOM reap and kill regardless. Too many parts of Linux were written with the assumption of overcommit.
Heck, as this issue shows even when there's technically free memory the OOM killer can still kick in.
I had heard about these issues before but never saw that in practice--at least not often enough that I looked closely enough. I knew Google and Facebook and others have been relying on userspace OOM killers for years to improve reliability and consistency. But only after having recently joined a DevOps team at a large enterprise with large enterprise, big data customers did I begin to see the depth and breadth of the problem.
Before getting into the hell-hole of helpless despair that is DevOps I primarily did systems programming. And, for the record, I always made sure to handle allocation failure in those languages where I could (C, Lua, etc) because as a software engineer it was always clear to me that, much as with security, an over reliance on policy-based mechanisms invariably resulted in the worst possible QoS. (Of course, in Linux even disabling overcommit is no guarantee that page fault won't trigger an OOM kill, but there are other techniques to improve reliability under load.)
> Is that due to dirty pages?
Yes, people have tweaked and turned those knobs and reduced incidence rate by maybe 20-30%. Yet nobody is sleeping better at night.
Computing shouldn't be a black art. The irony is that overcommit and loose memory accounting more generally was originally intended to remove the necessity for professional system administrators to babysit a system to keep it humming along. But now we've come completely full circle. And it was entirely predictable.
And before anyone says that cloud computing means you increase reliability by scaling horizontally, well "cloud scale" systems is exactly what are being run. But unless you massively over provision, and especially when certain tasks, even when split up across multiple nodes, can take minutes or even hours, an OOM kill incident can reverberate and cascade. Again, ironic because the whole idea of loose memory accounting is supposedly to obviate the need to over provision.
> What use is "full use" if your system live locks!?
By "full use" it doesn't mean "get to the point of live-locking". It means if the heuristic is sufficiently conservative so as to prevent live-locking, it will also necessarily prevent usage patterns that wouldn't live-lock.
If you refuse to drive without wearing a seatbelt, you're not making full use of the vehicle. The conservative heuristic "always wear a seatbelt" will protect you in a crash, but it will also necessarily inconvenience you even in usage patterns where you wouldn't crash.
Safety and reliability improvements are very rarely free. It's my opinion, and I assume wahern's, that Linux should be reliable by default, even if this reduces efficiency in the non-failure case.
A seat-belt is a really bad comparison because there's almost zero cost to enabling it.
A better comparison is a car that will only go to 100 km/h on a 120 km/h road. It's safer, less accident prone and more energy efficient and you will never need to brake because of hitting the speed limit, but you're not making full use of the car.
> A seat-belt is a really bad comparison because there's almost zero cost to enabling it.
Watch some documentaries about the time wearing seatbelts became mandatory and listen to people bitch how inconvenient, uncomfortable and dangerous (what if the car catches fire and the seatbelt gets stuck???) they are...
People still drive without seatbelts today, and still run into their windshield during minor accidents and die. But when they do, they will quietly be considered “at fault for” or “complicit in” their own death. They chose the convenience of “no seat belt” (no swap) over the reduction to “risk of death” (oomkiller)
Their friends will still blame anything and anyone other than them for the circumstances leading up to their death, but in unguarded moments, they’ll say “I wish they’d worn a seatbelt”. They blame the person that died for failing to take an obvious step of self-preservation, even if they don’t realize it.
Swap paging files have not reached this level of awareness, but we’re certainly a lot closer than we used to be to understanding as a community why modern system design requires paging files to operate in a dynamic memory allocation, generic work task, variable usage and specification environment. If you don’t, you’ll end up crashing into the overcommit windshield someday.
Is overcommit good? No more or less good than cars than can go 150mph (as their speedometers universally claim). They’re both design choices that no one has seriously objected to, and that means that seatbelts and swap files are necessary.
Yes, because something like an engineering margin or extra capacity for unpredictable situations is a bad thing. /s
People do not learn. No safety margin means things will fail so you need to expect failure.
Even if you do expect failure, there is a chance of unexpected kind of failure. Failures tend to be much more costly than the extra capacity.
And that's in predictable systems, not something as unpredictable as say an ssh server.
Well, the issue really is that applications are not being designed correctly.
To create reliable applications you need to design them to work within limits of memory allocated to them. Unattended applications (like RDBMS or application container) should not allocate memory dynamically based on user input.
By definition, reliable application will not fail because of external input. Allocating memory from OS cannot be done reliably unless that memory was set aside somehow. If the memory was not set aside then we say we are in overprovisioning situation and this means we accept that it might happen that a process wanting to allocate memory will fail to get it.
So the solution (the simplest but not the only one) is to allocate the memory ahead of time for the maximum load the app can experience and to make a limit on the load (number of concurrent connections, etc.) so that this limit is never exceeded.
Some systems do it halfway decently. For example Oracle will explicitly allocate all its memory spaces and then work within those.
The OS really is a big heap of compromises. It does not know anything about the processes it runs but it is somehow expected to do the right thing. We see it is not behaving gracefully when memory runs out but the truth is, OS is built for the situation where it is being used mostly correctly. If memory is running out it means the user already made a mistake in desiging their use of resources (or didn't think about it at all) and there is not really much the OS can do to help it.
> Unattended applications (like RDBMS or application container) should not allocate memory dynamically based on user input.
This is incredibly easy to say, and incredibly hard to do for any application of even moderate complexity. Getting people to accurately tune thread counts or concurrency limits is painful when working with diverse workloads, and getting a guaranteed upper bound on memory usage would be a lot harder. It's much more common to apply higher level techniques such as concurrency limits and load shedding to avoid OOMs and other resource starvation issues.
The other thing we can do is reduce the impact of OOMs, and accept that they exist as a tail event. There is software running on your system that doesn't have hard memory limits, and you will always have to consider OOMs and other similar failure cases when ensuring your system is resilient. As long as we can prevent those tail events becoming a correlated failure across our fleet, we can get a pretty reliable system.
> This is incredibly easy to say, and incredibly hard to do for any application of even moderate complexity.
I agree. But this is what it takes to build really reliable applications.
For example, MISRA rules (C and C++ standards for automotive applications) forbid dynamic memory allocation completely after the application started. It also, if I remember well, forbids anything that could make it impossible to statically calculate stack requirements like recurrency.
This is to make sure it is possible to calculate memory requirements statically.
Sure, but not all applications can favorly be designed like that.
It makes little sense for Excel to pre-allocate 4GB of memory to account for the user wanting to create a gigantic analysis in their spreadsheet over the next weeks, if all that the user wanted was to make a small sum of 100 narrow rows.
There is no problem for Excel to allocate memory dynamically for rows and columns as the user edits the sheet as long as it will also handle an allocation failure gracefully. That generally means the user input events call functions that fail and become no-ops until memory can be allocated again.
That's how we used to write software until Linux ended up running everywhere, and that was the only way to write software for platforms without a MMU. As recently as a decade ago mobile phones, before Android, often had strict memory management. Each allocation could fail, and these failure paths had to be tested.
Explicitly committing after a fork() would just make fork() cost a bit more but it would work. The fork() call would fail if there's no memory to duplicate all anonymous pages for the child process. But once it would succeed the child process would know it won't cause an OOM merely by writing into a memory location.
That's fine if you actually want to fork, but it's pretty awful if your real goal is to immediately exec. What I would suggest in that case (with MMU) is a limit on how many pages the new process can dirty before it execs. Then you only need enough memory to provide that limit. And it shouldn't be too hard to statically prove that the limit is sufficient. (The parent process would be suspended in the meantime so it can't cause any pages to be duplicated.)
The fork+exec is sort of a special, common case already. There have been attempts to explicitly implement that because even the current fork() has turned out to be too slow or too convoluted to implement, as witnessed by things like vfork() and posix_spawn(). We could just have fork() and fork_and_exec() sycalls separately.
The other common case is to actually fork() a child process. Even then it would suffice to just reserve enough physical pages to make sure the child won't OOM-by-write, but only copy pages when they're actually being written to so you wouldn't be copying a gigabyte of memory even if you never used most of it.
The kernel would allow applications to allocate at most physical mem size + swap size - any memory reserved for kernel. Thus you could still use MMU to push the least used pages into swap, and you could run more or larger programs that would fit in your RAM, but when memory is low these programs would always fail at the point of allocation, not memory access.
Interesting. What counts as dynamic memory allocation in MISRA? Can you ship your own malloc serving memory from a global static char buffer? What about statically sized free lists? Or stdio functions (fopen/printf/etc) that use malloc internally?
And this is not enough, because there are heuristic optimistic allocations and you will get OOM killed on pressure. On both Linux, Windows and FreeBSD. Probably also OS X.
(The user is at fault but they do not know why.)
vm.overcommit_memory should have never been a default... It's hard to change it back.
On top of this is the potential for so-called swap death by thrashing.
I worked briefly on the firmware for a range of enterprise storage products that had the same sort of constraints. To ensure total knowledge of memory use by the system at all times, dynamic allocation was not allowed. All buffers, queues, everything that could be wanted were allocated at the beginning of time. There was no possibility of running out of memory. There was the possibility of turning down an IO request if the system got flooded, but that wasn't going to be because a malloc failed, it was going to be because a known condition had been reached and no more requests could be accepted until the system cleared itself down a little.
Computer resources were managed similarly - the most important and most CPU-heavy resources would be tied to a core and given exclusive use of that core.
The only caveat here is that doing this all takes engineering effort and fairly deep knowledge of your OS. If you don't have mission-critical applications like this to write it's probably overkill.
(One kinda-cool side effect of the memory all being pre-allocated was that you could just browse the memory space of the process, everything was at a predictable address even without debug symbols.)
My understanding this requires "arcane knowledge" is because most frameworks and developers simply don't care. All new languages and frameworks are built to supposedly make things easy and fun and counting your objects or calculating your buffers isn't exactly fun.
Making things dynamic and hiding underlying memory layout is seen as a tool to write software faster but it disconnects developers from understanding what actually happens and makes it even more difficult to write reliable software.
If writing software the correct way was prevalent this would not be arcane knowledge but common sense.
There is possibility that you could write software at a high level and the compiler and tool chain could compute all the static allocation at build time. It could even save the plan as an output to be reused. You could run an optimiser yo optiate cache and memory usage.
If a human can do it, then an advance tool chain could do it.
The problem with that is that you're pushing the resource management problem onto the administrator (who now has to statically partition their system's resources) and the database (which has to reimplement a lot of the resource management that the OS would do, like I/O management and I/O caching).
There are many reasons why this is not a good trade-off for the vast majority of use cases. Unless you are running something like a traditional Oracle deployment where you're just giving over the entire host to the database anyway and the database is enough of a cash cow that they can afford to re-implement much of the functionality of an operating system.
It generally makes a lot more sense to let the OS manage the buffer pool using the available free memory on the system (this is what Postgres does, for example).
It also makes more sense for the vast majority of deployments to allow process memory consumption to be dynamic and deploy the appropriate limits and monitoring to keep things healthy - e.g. by terminating malfunctioning processes and reconfiguring and migrating applications as needed.
This doesn't meet some traditional ideals about how programs should be written, but actually seems to result in better systems in practice.
I also have the heretical belief that for the vast majority of applications, the best way to handle malloc() returning NULL is to abort the process.
The main reason is that the error handling code has a high probability of being buggy because allocating memory is pervasive in most code and it's unlikely that you're going to have tests that provide adequate coverage of all the interesting code paths. The consequences of incorrect behaviour are often worse than the consequences of a process crash.
The downside to this is that if your application ever needs less than the configured amount of memory then that memory cannot be used by other applications. This is a big reason why java is such a huge memory hog even if the application itself isn't demanding nearly as much memory.
I believe that Linux has all the tools to solve this in practice:
1. When you allocate memory, but do not write into it yet, those pages not mapped in RAM and thus don't occupy actual space.
2. When you're done with a specific page of memory, you can madvise(MADV_FREE) it, which means that the kernel can discard these pages from RAM and use it for caches and buffers. But you still hold on to the virtual memory allocation, so you can just start writing into the page again when you need more memory and the kernel will map it again.
If I understand all that correctly, you can have your allocator work in such a way that it keeps a large reserve of preallocated memory pages, but the corresponding amount of RAM can be used for caches and buffers when it doesn't need everything. An interesting question would be how that scenario appears in ps(1) and top(1), i.e. whether those MADV_FREE'd pages would count towards RSS.
I'd like to add that RHEL5 was more than tolerable in this aspect. Whatever kernel series it lived through, it seemed logical. We knew when RHEL5 systems started to swap that was our sign to A) look for long term usage growth that is beginning to bump some threshold somewhere B) someone turned a knob they shouldn't have C) a recent code change has caused a problem in the application stack
Then RHEL6 came along, and it swapped all the time. Gone was our warning. The stats shows tens of gigabytes of cache engaged. WUT? How do we have tens of gigabytes of memory doing nothing? Before you could finish that thought, OOM killer was killing off programs due to memory pressure. WTF? The system is swallowing RAM to cache I/O, but couldn't spare a drop for programs? ...I could go on, but simply put, RHEL6 was garbage. And really I mean the RHEL6 kernels and the way that Red Hat tweaked whatever they did for it.
RHEL7 was a little better, but still seeing echoes of the ugliness of RHEL6. RHEL5 was just a faded pleasant dream.
The last 3 Fedoras on the other hand, the memory management seems like we're finally digging ourselves out of the nightmare. That nightmare lasted almost a full decade.... sheesh
What am I missing here? Why can't, as others have mentioned in other HN comments elsewhere, the OOM killer just get invoked when there's less than X amount of RAM left, and kill the highest-offending process? In my case, I would prefer that to anything else. Why does this page or that page matter?
There's no reason the OOM killer can't be made more aggressive, and there are user-space implementations of that behavior. I use the "Early OOM Daemon"[0], which is packaged in Debian. I had problems with my system locking up under memory pressure before, but so far earlyoom has always managed to kill processes early enough to prevent this.
Because simplistic policies like this can never satisfy everyone.
On a server the process using the most memory might be the mail server so killing it is a bad idea... unless it normally never uses more than 2GB and for some reason it’s eating 14GB at the time.
You have to define quality of service, set relative priorities within those categories, be able to correlate memory usage with load to detect abnormal spikes vs normal spikes, etc. For example a daemon’s memory usage is probably correlated with its open socket count.
That in turn relies on a system much smarter than init scripts that understands the system is under memory load so re-starting a daemon that was just killed might not be appropriate (which gets into policy questions too).
For interactive use things are even more complicated; you need a mechanism for the active application being used by the user to donate its ultra-high priority to any daemons or child processes it is talking to. If image_convert is running because I initiated it as the user then it should get more memory (and CPU/IO). If it is running because my desktop window manager is refreshing it’s icon cache then it should get its priority smashed hard - if the system is under pressure or burning battery it should even be killed and prevented from restarting temporarily. Who is going to do all the work in the kernel, userspace libraries, then get all the apps to properly tag their threads and requests with the correct priority?
tl;dr: setting policy is hard and anyway it becomes the herding cats problem.
I think that all processes are needed on the typical server, monitoring included. If one often runs out of RAM, buy more physical or virtual RAM. Furthermore we probably have more than one server if we really care about the service they offer. One server crashes, another one spins up, the others copie with the load meanwhile.
The option to kill processes is more interesting on personal machines, where sometimes adding RAM is not even an option. My anecdotal experience with a laptop without swap since 2014 (16 GB, then 32) is that I got near the limit a couple of times. I wouldn't mind if Linux killed any of the browsers, emacs (which is tiny nowadays), Thunderbird. They can recover. Maybe kill even databases. Leave me XOrg, the desktop environment and a terminal to investigate the problem.
You do understand that no amount of RAM will fix a buggy or leaky application? That buying more ram has weeks of latency or is altogether not possible by hardware or politics?
If you know you want to keep bare Xorg alive with no window manager, taskbar or desktop, there's oom_adj magic knob, accessible by systemd.
And memory cgroups.
But on a shell server, you actually want to kill that Xorg but preserve background tasks.
I understand that an out of control application can't be made to behave by any amount of RAM. But it's out of control, so if it dies it's (kind of) OK. I saw memory leaks in production and the usual workaround is rebooting the application or the OS once per day. Sometimes more than that. And then warming up the caches ect.
On a shell server there is no Xorg running, not even on machines with a partial of complete X11 installation because (example) they need fonts to generate PDFs. I just checked a few of them.
On a desktop/laptop Xorg is the last thing I want to hang or be killed because it makes it harder to check what went wrong. I'll look into oom_adj, thanks.
I run the script to display the oom_score_adj of the processes running on my laptop. It's 0 for all of them except the blink based browers (including Slack). For those processes it's either 200 or 300. It means that they'll be the first ones to go. Apparently their developers played nice with the system.
Because you're not truly out of o memory at that point. there are still pages that can be evicted.
to invoke the oomkiller sooner you have to use heuristics to determine when a request could be satisfied in theory but not practically due to page thrashing.
The oom killer only kicks in sometimes, e.g. when programs make truly egregious allocation requests.
The benefit of having swap is that it turns things into a soft degradation since it's much easier for the system to start with swapping out rarely used pages. The gradual loss of performance makes it easier for the human to intervene compared to the cliff you encounter when it starts dropping shared code pages.
> The oom killer only kicks in sometimes, e.g. when programs make truly egregious allocation requests.
This is a myth. Allocation failures happen at least as much on small allocations as on big ones. In fact, I see OOMs every day and the vast majority of the time the trigger was a small allocation. For example, the kernel trying and failing to allocate a socket buffer.
And that's really the root of the issue. You have a giant application with a 200GB committed working memory set doing important, critical work; and it gets shot down because some other process just tried to initiate an HTTP request. It's a ludicrous situation. And people defending Linux here by saying the same problem exists everywhere else are wishful apologists--the situation is absolutely not the same everywhere else.
Even setting aside the issue of strict memory accounting--which, BTW, both Windows and Solaris are perfectly capable of doing, and do by default--Linux could still do dramatically better. Clearly there's some level of unreliability people are willing to put up with for the benefits of efficiency, but Linux blew past that equilibrium long ago.
E.g. == for example, other cases are permitted.
What I am saying is that it only kicks in under some circumstances, not necessarily when one wants it to.
> Because you're not truly out of o memory at that point. there are still pages that can be evicted.
No, you are truly out of RAM at that point - the amount of RAM all processes need exceeds the amount of memory the system has, and the user has indicated that no swapping should be done.
Now, if the kernel truly wanted to handle this case gracefully by swapping disk-backed files, I think it should also tell the process scheduler about this, and enter a special mode where only processes whose code is currently resident would be allowed to run, until they hit a hard wait. This might prevent the thrashing behavior in many cases (assuming processes don't interact too much with memory-mapped files). Otherwise, every context switch initiated by the scheduler is likely to cause another page fault event.
> the amount of RAM all processes need exceeds the amount of memory the system has
The word need does the heavy lifting here. Strictly speaking it does not. It CPU only needs a few data and instruction pages at a time to execute code. The former may, the latter frequently will be backed by memory-mapped files.
If your program contains huge unicode lookup tables, megabytes of startup code and so on then it is perfectly reasonable to discard those pages. It would be wasteful to keep those resident, especially on memory-constrained devices. Not having swap is not the same as not wanting paging to happen.
What the human wants is totally different from what the kernel needs to keep chugging along (at glacial pace). Bridging the gap is what is discussed further downstream in the mailing list linked by the article, and it's only possible based on previous work (PSI) that was added fairly recently.
Why is the kernel deciding if the memory is still needed or not? The kernel can only know how it is used, not why it is used. It can guess the purpose with heuristics, but only the applications know exactly why the memory is used. The correct solution is to disable overcommit, which will force programmers to make use of that superior knowledge to free memory when they no longer need it.
We're in this crazy situation because somebody decided to sacrifice reliability to increase efficiency. Efficiency is easier to measure so they got away with it. Fixing the live-lock problem makes things slightly better, but it's not solving the underlying problem. Andries Brouwer's classic comment remains as relevant as ever:
Why can't the process itself decide how to handle what should be a null pointer return from malloc or mmap et al?
In fact, firefox' and chromes example is the bane of my existence. The recovery feature itself takes a good amount of space. However efficient it may be, it's no use for private sessions. I'm aware of tab suspender add ons but never tried it. You can't tell me to clean up the ship and instead you burn it down and let it sink in the ocean? Is it my messy browsing (400 tabs can grow to 8gb quickly and it should always remain snappy, too, no eager unloading) or is that just symptomatic of a let it crash darwinism? It is indeed just as impossible for me to memorize all the pages, 90% wikipedia articles and related content, nothing important I'm sure. But it is utterly ridiculous that a gif which rendered would take hundreds of megabytes is a surefire way to crash the session that had been skimming the edge of 200 mb free resident memory left for weeks. I had resorted to polling 'free' frequently for display in a taskbar plugin. But sometimes over night something would hog memory and not free it, either a systems cron-job, an aberrant javascript (though no-script was rarely "temporarily" disabled) or a memory leak (same thing?).
Of course a browser with leaks could not be expected to handle oom gracefully.
> Why can't the process itself decide how to handle what should be a null pointer return from malloc or mmap et al?
At the point where the kernel knows that it should be returning a null pointer, you're probably already operating in a severely degraded state. You will be scanning the page table constantly, looking for pages to reclaim, wasting tons of CPU. You will have reclaimed all of the reclaimable pages you can reclaim, so io utilisation will be through the roof - calling a random function will cause a disk read to page the executable into memory. Your system will not be functional. And that's ignoring the reality that modern MMUs don't actually let the kernel know when all the memory is in use.
If you want to handle memory pressure better for a given application, shove the application in a cgroup and use a user space oom killer. It's not possible for a program to react gracefully to a system OOM.
At the point where the kernel knows that it should be returning a null pointer, you're probably already operating in a severely degraded state
That's only currently the case. It doesn't need to be. As per your example, you could just as easily provide an upper bound to the page table scan, and return NULL when enough pages can't be found in 100us.
It's not possible for a program to react gracefully to a system OOM.
A program should never receive a system OOM, the fact that Linux' memory accounting is so bad the system itself can run OOM is the problem. It is perfectly possible for a system kernel to execute in bounded memory, and never let programs claim parts of that. Linux just isn't designed that way.
You could def provide an upper bound, though I imagine that'd just cause more people to be annoyed on this thread, complaining that linux doesn't give them memory when there's tons available. And to be honest, if you want this behavior you can basically get it already; check your mem.pressure and react to that. Linux isn't going to give you a null pointer, but you can def react to memory pressure.
> It is perfectly possible for a system kernel to execute in bounded memory
Is it possible while maintaining anything like the feature set that linux provides currently?
The only other option is failing mmap and malloc as well as disabling overcommit. And even that doesn't prevent death by swapping.
There's no real API to inform applications about low memory conditions.
Although I am not a fan of Apple, their Memory Pressure messages on iOS are really useful to prevent this. You get warned as a programmer to clean up your memory if possible, otherwise iOS shuts you down.
Also AIX has SIGDANGER (best signal name ever). There is also "oomd", I think like the iOS thing, for Linux, in user space. Not sure how useful any of these things really are but wondering if FreeBSD should get one...
Yes, but as a result, memory-intensive applications (usually games) crash all the time, especially on older devices, which the App Store won't let you filter out.
What are you going to anyway, just start deallocating the memory your application needs to actually work, so that you can display an "I'm sorry, we're out of memory" message? Might as well crash then.
The amount of memory available also varies wildly from user to user and you can't put a memory requirement on an application.
So what do people do? Disregard the issue. Even the best games on the App Store have tons of one star reviews by people with older devices.
The flip side is that application that are prone to crash are pressured to optimize for relatively quick and painless restart.
i routinely see these days systems without or with very low swap. It is like swap has become faux pas. That is especially strange giving the SSD drives available on the machines. Gradual degradation of service vs. the service sudden disappearance and/or stall or heavy overprovisioning and still ...
Also comes to mind - while not generic swap - kind of edge case, a modern version of swap, ie. extending virtual memory space onto flash storage - Facebook replacement of some RAM with NVM https://research.fb.com/wp-content/uploads/2018/03/reducing-...
When installing a new OS recently, I struggled to find a definitive view on what to do with swap. There's no data about it, just superstition, and the thread about this yesterday was full of people saying nobody uses it.
Maybe I can invoke Cunningham's law? I use 16GB on an NVMe SSD, which is equal to the size of my RAM.
I've been told that 90s-era manuals for SAP systems recommended swap size = 2x RAM size. There's a story circulating in the office that someone remembered this rule in 2012-ish when setting up a bunch of fresh systems with 2 TB RAM each, causing the storage support team to wonder why the fresh filer was already full after just a few days.
And there is Cassandra DB, that advices to deactivate swap. I am wondering what happens on servers, running several Java Services with Cassandra and without swap, once memory pressure gets high, even due to system caches, will it stall as well for an eternity?
At some amount of active memory use Linux will grind to a halt. This happens with or without swap. It's a problem of how it handles low memory situations.
But if you have swap, this point comes much later. Often many gigabytes of unimportant data can be moved out of memory. If you can swap out 6 gigabytes without causing problems, and you only needed an extra 4 gigabytes of ram, then swap saves your day with almost no downside.
Your system would grind to a halt anyway; swap is simply giving the kernel more options to deal with the issue. Without swap the kernel has no options when it cmes to unreclaimable pages.
I didn't enable swap for the first 6 months, every time I ran out of memory the system froze. Now I use swap and only the offending applications do actually freeze. There is a small problem though, one application in particular permanently captures the damn mouse in fullscreen mode so if it freezes I can't interact with the rest of the system with the mouse but this is purely a bug in the application, not a problem with the operating system.
Well, it's more of a design problem of X11. Under Wayland, the compositor could offer a Secure Attention Key like Ctrl-Alt-Del that reclaims input focus from a fullscreen application. As far as I'm aware, such facilities do not exist in X11. When an application grabs the keyboard, it's grabbing the entire keyboard.
Yes, the problem is that X11's login/lock screen is secured by grabbing the keyboard and mouse, taking away input focus from the login screen in many cases will allow you to bypass it entirely.
Such facilities actually do exist in Xorg, they are just disabled by default. But you can configure the server with a shortcut key to release input grab.
i would venture a guess that probably you saw something like for example large JVM being partially swapped on HDD - in a situation like that it isn't swap per.se. which is at fault, it is the fact that JVM memory access pattern - randomly jumping all over - is extremely unsuitable for HDD based swap.
Actually, there is a critically bad interaction between the way that garbage collection algorithms work and the LRU heuristic for swap. Tracing garbage collection algorithms work by doing a graph traversal that traces all the live pointers starting from the "root set" (ie, pointers in registers and on the stack) to find all the live objects.
Now, note that (a) the root set are the hottest objects in memory, and (b) they are the ones that gc algorithms touch first. As a result, when you are doing a gc, the LRU (least-recently-used) heuristic for managing swap ends up putting all the hottest data in the program onto disk. This is pretty much the pessimal thing for the kernel to do, and can tank performance by triple-digit factors.
However, this is entirely fixable!
The gc and the kernel need to cooperate to figure out what is okay to push to disk or not. Matthew Hertz, Yi Feng and Emery Berger have a really nice PLDI 2005 paper, Garbage Collection without Paging, which shows how you can make gc performance hundreds of times better with just a little kernel assistance.
However, because kernel developers do not program in garbage collected languages (they program in C, because that's what the kernel is written in), they don't understand garbage collection and are hostile to kernel changes to support it. Emery Berger told me they tried and failed to get the Linux kernel to support their changes, and people who worked on .NET told me they had similar experiences with the Windows kernel team.
I feel that the problem is ill-posed: "At what amount of free RAM should we start freeing RAM by killing processes?"
Maybe one should solve the problem: "At what eviction-induced latency should the OOM killer be invoked." Thanks to the infrastructure around latencytop, the latency might be available already.
Of course, the never-ending dilemma of what process to kill is still there.
Preferably none. Saner APIs like Android will call low memory signals and start swapping out applications both ahead of time and as needed.
It also has a relatively smart memory manager to detect unexpected vs expected spikes as well as slow leaks.
POSIX has no such API. It was designed in a simple time.
Can you point to documentation on how Android deals with low-memory situations? My understanding is that:
(A) They moved the problem to user-space.
(B) Apps are supposed to constantly save the state they care about and expect to be killed whenever Android sees fit (or even at every app switch for developers).
The particular issue of locking up a system because there are too many tabs open in Firefox or Chrome should be fixed in the browser, because the browser is the program that is taking up all of the memory and is in the best position to know how to recover a lot of memory quickly while minimizing the loss of work for the user. One of the comments suggested having the kernel send a 'memory pressure' signal to all processes, and a process could catch the signal and do garbage collection or drop other storage. But even without such a feature, a program could do polling to determine that there is memory pressure, or do a check when a user asks to open a new tab, etc.
My experience for a long time has been that Linux is only usable with earlyoom. Running without it is a guaranteed way to end up having to power-cycle my system, often multiple times per day.
I managed to achieve this once, fresh from the world of Windows - missed breaking out of a loop while exploring Python(?) on Ubuntu 10 or so. Had to power cycle my damn machine.
That's because the whole Android API and ecosystem was designed to make it possible.
You cannot get away with that and POSIX applications. They have zero or bad session support. There's no global always available database to help you (Android has 4), nor event scheduling bus. (Dbus is a joke, it does not allow sending event to non-existing endpoint to be delivered later.)
Every application using the standard mechanisms will get relaunched and activated as needed if it got shut down. It will receive a Bundle with state it managed to save and with original launch Intent. It can access an sqlite database to be made available via a restartable ContentProvider. SharedPreferences are also stored. Etc.
In Linux world, there are no such de facto standards and what is most common is utterly broken.
Same in Windows (only registry is persistent, it's not meant for data storage) and in OS X.
What use is "full use" if your system live locks!? This is similar logic to that of overcommit--more "efficient" use of RAM in most cases, with the tiny, insignificant cost of your processes randomly[1] being killed?
What happens in practice is that people resort to over provisioning RAM anyhow. But even then that's no guarantee that your processes won't be OOM-killed. We're seeing precisely that in serious production environment--OOM killer shooting down processes even though there's plenty (e.g. >10%, tens of gigabytes) of unallocated memory because it can't evict the buffer cache quickly enough--where quickly enough is defined by some magic heuristics deep in the eviction code.
[1] And before you say that it's not random but can be controlled... you really have no idea of the depth of the problem. Non-strict, overcommit-dependent logic is baked deep into the Linux kernel. There are plenty of ways to wedge the kernel where it will end up shooting down the processes that are supposed to be the most protected, sometimes shooting down much of the system. In the many cases people simply reboot the server entirely rather than wait around to see what the wreckage looks like. This is 1990s Windows revisited--reboot and move along, unreliability is simply the nature of computers....