It always seemed odd to me that page reclaim was completely ignorant of CPU/IO priority of the accessing processes.
If I run a backup job with a low priority, especially before cgroups existed where I could scope the task's memory use, there's nothing preventing it from displacing everything else from the page cache as it churns through an entire filesystem. Even if everything else in the page cache was put there by equal or higher priority processes.
It seems like the priority should at least hint reclaim some amount rather than just age and activity levels... like the page struct should record the highest priority process accessing its contents, and not let lower priority processes evict that page for reclaim unless pressure is high enough.
> like the page struct should record the highest priority process accessing its contents
I'm not aware of any architecture where the MMU knows anything about task scheduling priority when marking pages as dirty or accessed, so there's a hardware limitation there. A close approximation might be to look at the task's (at the kernel level, threads and processes are tasks) scheduling priority when page-faulting in a page, to set some score on the page, and decrementing the score when the eviction algorithm finds the page is untouched. I guess it would then evict the first page found with a score of zero, or after examining N pages, evicting the lowest scored page found. (One could also use weighted reservoir sampling to achieve something closer to fairness, so high priority tasks don't always starve low priority tasks.)
In the case of multiple generations of LRUs, presumably one could use the I/O priority of the faulting task to determine the initial generation of a newly faulted-in page. This would be a pretty minimal change (and might already be implemented but that detail left out of the LWN article) and wouldn't require any extra state in the page struct.
Though, maybe you get close enough to the same effect simply by taking scheduling priority into effect when paging in pages.
Some university OS courses must have some rough kernel simulators where these sorts of policy changes can be simulated, right? Or is it easier to actually just modify a Linux or *BSD kernel and run some benchmarks?
I'm x86 centric, but the MMU doesn't make decisions about which pages to evict; it just handles mapping virtual addresses to physical addresses based on the page table, and marking pages in the page table as dirty/accessed.
The OS would need to manage priority information to evict lower priority pages. That may be part of what's done with posix_fadvise information, but I haven't dug deeply into that.
With this mutli-generational LRU, you could probably add a memory priority to a process, and use that to decide which generation to put pages in when a process needs them. If a low (memory) priority process's pages always go in the oldest generation, they would get considered for reclaim much more often than a high priority process whose pages get put in the youngest generation.
I know the MMU doesn't make eviction decisions, but it does mark the "dirty" and "accessed" bits for pages.
The GP was suggesting keeping track of the scheduling priorities of tasks that were "accessing its[the page's] contents", not just keeping track of the highest priority task mapping a given page. This would require the MMU, as it's setting the dirty and accessed bits, to also keep track of the maximum priority that has set either the dirty or accessed bit.
Edit: from the GP's reply, it sounds like by "accessing" they really just meant mapping, not accessing is the sense of the accessed and dirty bits.
The dirty and accessed bits are per page table structure, so they are per-process. So you can at least take process priority into account, if not thread. In theory you could take thread priority into account by observing and resetting the dirty/accessed bits every context switch, but this would be unlikely to be worth it.
> In theory you could take thread priority into account by observing and resetting the dirty/accessed bits every context switch, but this would be unlikely to be worth it.
That would be most likely terrible (as you suggest). You might have better luck giving each thread a copy of the process page table, and keeping that in sync somehow (which also sounds terrible, but might be more workable)
I assumed it'd be maintained when the processes mapped pages into their address space, not at page fault time. Then if the process' priority increased, I guess it would have to walk the mapped pages to bump their priorities too for completeness.
Not that I spent much time thinking about the implementation details or seriously exploring how to go about making it happen. Having to touch all mapped pages when a process changes its priority sounds painful, so it probably requires more sophistication to make things optimal.
Ahh, by "accessing its[the page's] contents", I thought you meant keeping track of the priorities of the tasks actually making the read and write operations that cause the MMU to set the dirty and accessed bits.
It sounds like by "accesed" you weren't referring to the page access bit set by the MMU, but rather just having the page mapped in the address space.
The backup process should be using dropbehind via fadvise(POSIX_FADV_DONTNEED), that way it wouldn't clobber the page cache. borg backup uses that for example.
That would require maintaining priority for every page in the cache, or even to process-if a process using the cache page ends, priority should be adjusted. Too complicated. Better simpler option is that the backup process says "do not cache this". That option is already implemented in kernel and available to userspace.
Hmm yes, unwinding the priorities when processes exit is complicated. But I wonder if simply leaving the priority at its highest level reached would be harmless; it really depends on how much influence the heuristic has.
It doesn't strike me as particularly wrong for a page to still be classified as elevated priority if some higher priority process had classified it as such even if that process no longer exists. At least not for file-backed pages, which are really what I'm interested in, in this context. Chances are good the file gets accessed again in a similar fashion, the priority stickiness could help keep it around.
The "do not cache this" approach is quite heavy-handed and can be problematic if you want some caching for things like producing deltas, or in the case of a restore from backup reassembling from myriad delta files, without blowing out the page cache.
As corbet mentioned the cgroups approach does provide a solution to constrain page cache effects.
That's explained in the second paragraph of the article:
> Chances are that pages that have been accessed in the recent past will be useful again in the future, but there are exceptions. Consider, for example, an application that is reading sequentially through a file. Each page of the file will be put into the page cache as it is read, but the application will never need it again; in this case, recent access is not a sign that the page will be used again soon.
In a similar vein I once found a sharp decrease in cache misses when I started proactively evicting outbound messages from cache after sending, in a system passing messages between CPUs in circular buffers. (The reasoning generalizes but I would be skeptical that the results do without measuring.)
If I run a backup job with a low priority, especially before cgroups existed where I could scope the task's memory use, there's nothing preventing it from displacing everything else from the page cache as it churns through an entire filesystem. Even if everything else in the page cache was put there by equal or higher priority processes.
It seems like the priority should at least hint reclaim some amount rather than just age and activity levels... like the page struct should record the highest priority process accessing its contents, and not let lower priority processes evict that page for reclaim unless pressure is high enough.