Hacker News new | past | comments | ask | show | jobs | submit login
Linux with “memory folios”: a 7% performance boost when compiling the kernel (kernel.org)
238 points by marcodiego on June 15, 2021 | hide | past | favorite | 158 comments



Now, consider that there's work going on to enable Linux to be compiled with profile guided optimization with clang[0], the DAMON patchset that enables proactive reclamation with "32% memory saving with only 1.91% runtime overhead"[1] and performance improvements achievable with futex2 system call[2].

Linux future seems bright with regard to performance.

[0] https://lkml.org/lkml/2021/1/11/98

[1] https://lore.kernel.org/lkml/20210608115254.11930-1-sj38.par...

[2] https://lkml.org/lkml/2021/4/27/1208


Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for. Phone, laptop and a few server use cases.

It then needs someone to do a big parameter tuning to select optimal settings.

Too many decent algorithms don't make it into the kernel because there are too many tunables, and the ones that do typically arent well tuned for anyone's use case.

Even big projects like Ubuntu typically don't change many tunables in the kernel.


> Linux needs a standard set of benchmarks that are vaguely representative of things users use Linux for.

Have you seen phoronix framework + benchmarks? They use common tools, but did some good work on making the tests repeatable and accessible to anyone else.

https://github.com/phoronix-test-suite/phoronix-test-suite/


Linaro is also doing a lot in that space (at least for ARM): https://www.linaro.org/os-build-and-test/


Just minor nitpick. Phoronix is one guy


Correct, but I like British grammar, where companies are plural :-)


The test suite is open source, I'm sure Michael would appreciate some help :)


they can be singular


[flagged]


Only if "They" is their given name and not a pronoun. If used as a pronoun, you can say "they are tall" in singular:

https://apastyle.apa.org/style-grammar-guidelines/grammar/si...).


Seems like a violation of the strict aliasing rule.


How so?


Because you use a "multiple" of pointers to refer to a single entity. (It's a C joke.)


That page you link to uses very ideological content - not philological as one could expect.

Not all will agree with those guidelines, they are far from universally acceptable.


Think about it this way, we're playing Guess Who, and you haven't asked about gender yet. You ask "Do they have dark hair?". It's obviously a question about a singular person, and that's the correct way to structure that question in English.


Certainly. But I referred to the presented guidelines, collectively - not to "singular they". I wrote and intended that _those linked guidelines_ are "not universal".


Singular they in English usage dates back 600-700 years. People can disagree with those guidelines all they want - they will have to deal with the fact that people choose to use singular they and they can get worked up over it or get used to it.


Do you feel the same way about generic he? It too has a long history.


Yes, generic he can be used in the singular as well.


It is clearly valid. I personally don't care if others use "he" as a generic, but seeing at some people don't feel included when it is used, I increasingly use "they". Not least when I don't know someones preference, because I consider the most inclusive option to be the most polite.


I don’t feel included by they.


You objectively are, and in the choice of who to care about that means I have absolutely no interest in catering to people upset that others are included.


There’s no difference objectively between generic he and generic they. Both cover everyone. You just prefer one to the other. That’s fine, so do I.


There is an objective difference between generic he and generic they: In many contexts you can not tell whether the "he" was intended to be generic or not.


I did not write anything about "singular they". I wrote against the legitimacy of those linked guidelines, which accidentally include "singular they".

The poster I replied to stated (I remember) that some use of language is made legitimate by some group laying out some guidelines. That some pretty random group («not philologists») lays out guidelines is maybe an "affiliation pass" for their group, but not universally valid.

You need grounds, good grounds. Check those guidelines...


How so philological? Language changes all the time due to practice. People don't need to justify their speech with philological arguments.


Yes, although in the case of singular they, its usage goes back to Chaucer, so it’s more established than most practices in modern English.


I believe they do. Practice must have a reasonable reference. People are free to contribute to the language - but with some competence.


Where's the ideological content?

Some people: "let's use, of already-established pronouns, the one that doesn't make any assumptions about people" You: "no, fuck you, other people's interpretation of what I say is strictly their problem, I'm gonna stick with 'he' no matter what other people feel or request, because I can"

The only ideology I can see here is the absolutist sort of "nobody can even suggest to me that I change a thing" or "I can't possibly cause offense if I don't mean to, so I don't need to think about word usage."


I don’t understand how it doesn’t make assumptions about people. In the case of a generic antecedent, sure that’s fine and well established historically. But in this case there’s a known antecedent, Michael Larabel.

I don’t know what Michael’s preferred pronouns are, but isn’t the original poster in this chain assuming it’s they/them and aren’t they more likely, statistically speaking, to be he/him?

Why does it somehow not count as misgendering when you they/them someone that prefers he/him or she/her?


> Why does it somehow not count as misgendering when you they/them someone that prefers he/him or she/her?

Because one appears to make an assumption about gender, whether or not you intended it to be taken as a generic, while the other objectively is generic.

As far as options go, they/them minimises assumptions. Yes, some people might still take offence, but given that most of the people who take offence at that takes offence because they oppose inclusiveness of others, I'm perfectly fine with not respecting their choice in the matter.


but given that most of the people who take offence at that takes offence because they oppose inclusiveness of others

You have some evidence of this or just the unshakable certainty of the self righteous?


I go by experience. There may be exceptions, but I've yet to see one, so I don't particularly care if there's a large pool of exceptions outside the horizon of my personal experience.

If you are an exception and have a good reason for taking offence that doesn't involve excluding others, do tell.


No, but "They are tall" is, which is gramatically plural but semantically singular.


Is it grammatically plural? In "You are tall", "you" can be singular, and then so should "are" be, if I am not mistaken.

(I'm not a native English speaker)


“You” was exclusively plural (singular equivalent was “thou”), but its meaning shifted to be either singular or plural a few hundred years ago. So “are” was indeed a plural-only conjugation until that shift happened.

Now “you” is shifting again: in formal English it is still singular or plural, but in colloquial American English — at least where I’m from — it can only be singular. The plural form is “you guys” or “y’all”, depending on dialect. (The “guys” in this pronoun shouldn’t be confused with the noun “guys”, which is usually only used for boys and men; “you guys” is used to refer to groups of any gender).


Exactly. There's loads of examples of such shifts in other languages, for example in Portuguese the formal you is conjugated in grammatical third person despite being semantically second person, arising from people being formally addressed in the third person ("Does the Right Honourable gentleman agree that...").


Same is true in Spanish, where "usted" ultimately derives from "vuestra merced".

For a completely different example, "on" in formal French means something like "one" in English: a pronoun for a general, unspecified person; as such, it takes singular conjugations. However, in colloquial speech, it means "we" and has almost completely replaced "nous" as a subject pronoun.

Thus: "il est" (he is); "on est" (we are, colloquial language); "nous sommes" (we are, standard language)


Is "You art tall." a grammatically valid sentence? Obviously not; don't be silly.


Without knowledge of preferred pronouns, "they" seems appropriate.


As I read it, the point was not about the wrong pronoun but about the fact that the suite is developed by one person and not a large team and that the burden of benchmarking Linux performance should maybe not rest on one person alone.


You missed his point. There was no wrong pronoun, this is what he is claiming.


viraptor 1 : 0 the west.


...and people like to pretend this kind of nonsense hasn't penetrated into every social space.


It unfortunately still hasn't penetrated into the most old school ones, apparently. Whatever you are talking about anyway.


If by "old school" you mean "everywhere outside of Silicon Valley", then maybe. Coming from a pretty conservative society, I read these sorts of discussions with mild amusement. I may have formed a mistaken impression of the US society, but it seems to me that it tends to bounce from one extreme of political or social views to another.


I would not know, I live in Europe ;-) Extrema do exist here though.


A single data point does not an argument make


But Michael tends to use "we" in his writings.


One of the things that struck me recently is that working out what a computer is actually spending it's time doing is now almost a Master's thesis to do properly.


Depending on the level of analysis you can even get a PhD. See for example https://core.ac.uk/download/pdf/18462805.pdf


Not surprising at all, sadly.


I've found it the hard way when I was trying to profile my own code.

If you're using a few optimized libraries and designed your code for high-speed parallel execution, the compiled thing becomes almost impossible to trace in a practical manner.


Didn't perf work? What difficulties did you encounter?


Used perf record for getting CPU efficiency numbers, and it worked very well, however to be able to reliably trace the code, I had to compile it in debug mode with no optimizations, and run with a single thread. That combination increased the execution time considerably.

Some of the Matrix and solver code coming from Eigen is heavily optimized for SIMD operations and using it with -O0 is just painful.

Moreover, I had to verify its memory sanity and used Valgrind for that. Using a full size problem meant it had to churn for days to finish execution.

Having deadlines doesn't always help in this stuff.


I found perf pretty crummy compared to vTune and even vTune is not fantastic at scale.


I think perf and valgrind is a fantastic combination. While valgrind has a big overhead, it can make a lot of things visible without instrumenting the code.

I also think Perf is phenomenal for its scope. It's showing performance metrics of the code without any processor dependency and instrumentation.


On the positive side of things, the notion that performance is multivarious and means different things to different contexts has really sunk in with the general tech demographic and we don't see much of the "one big number that sums it all" type of stuff. Think the GHz race or FPS measuring contests.


Both with the GHz race and FPS measuring contests, we've gotten to the point where there's substantially diminishing returns - GHz has stopped growing in the same way, and reaching the limits of human perception with FPS is generally readily achievable, and so it loses interest as a contest.

I suspect your broader point stands - certainly "performance is multivarious and means different things in different contexts" has sunk in for me. But I suspect the actual difficulty of reasoning about performance deters most people from doing it beyond big O notation. The fact that I'm aware that performance is complicated doesn't drive me to understand the complexity, I mostly just say "good enough" and move on.


reaching the limits of human perception with FPS

Nothing can hit 200fps at 4K, which you'd need to reach low enough latencies to be imperceptible.


I wonder if someday you could make a differentiable kernel and do gradient descent in kernel-parameter-space


I believe Google already does this for setting their server's tunables.


At Google's scale, you don't even need it to be differentiable. You can just do simulated annealing or so.


Right, problem solved. Now we just need to identify the few representative use cases. Is it people using a phone, as, er, phone with actual voice communication? or video conferencing? Or those who play games casually or not-quite-so-casually? Or video streaming? Or just for texting and would rather prioritize security and battery runtime?

And that's just the phone. I could come up with many more questions regarding desktop and server usage as I'm more familiar with those.

I'm afraid, there is no one optimal set of configuration options. Not even two or three.


A few server use cases? It's used for like 99% of all server use cases.


There are already plenty of benchmarks


Tuning is highly overrated in software performance, eg compiler optimizations don't help that much and searching their params helps even less. I think it comes from people wishing they could change their software without actually needing to learn how to change it.


It is overrated when you are tuning the wrong thing, but if it is the bottleneck in a significant process for you use case then it is often very significant especially at scale. For someone running a compute cluster for their own needs or as a cloud service a couple of percent improvement like this represents a massive gain in throughput or a saving in the need for extra kit (or power costs - there may even be a small environmental benefit).

Of course on the scale of you or I, such optimisations may be little more than a curiosity most of the time. A process that normally takes 100 minutes (a video transcode, perhaps) now taking 95 followed by the machine being idle while it waits for us to look back and see the result, is not really benefiting from the benefit.

> without actually needing to learn how to change it

This is certainly a problem sometimes, but not an argument for tuning bring overrated in general. It usually comes down to tuning the wrong thing, like someone playing with mysql engine & kernel IO parameters to eek out a fraction of a of % bonus when they could improve index structures, or fix queries with no sargable predicates, or both in unison, to get benefits measured in orders of magnitude.

It can be overrated in the case of average gamers tweaking their hardware to get a few extra FPS on top of the many tens they already get, but again this not being worth the time for some, even many, doesn't mean it can make a huge difference to a pro-gamer or someone using old kit where "a few extra" is a relatively large gain.


That's quite a bad take. You can bank on PGO+LTO of a C++ server being good for 20% throughput easily, and not compiling for k8-generic when you have a modern machine also makes a big difference. I would agree that just flipping compiler flags might not get you very far, but there are key inputs that you shouldn't leave on the table.


By "not that much" I mean orders of magnitude. 20% is reasonable, especially for C++ that needs a lot of inlining.

But a more modern language would benefit from defining some optimizations like inlining as mandatory, the same way tail calls are (and of course GCC has this.) Then there isn't a chance of deploying a build with extra-slow behavior.


A performance improvement of 20% would be a massive win for something as large and widely used and as the linux kernel.

That would average out into a few percentage points of performance improvement in every application across the globe which runs on linux / android. Merging a patch like that in linux would be like quietly reaching into every device across the planet and giving them a small, free CPU improvement and drop in power usage. Not game changing for any individual user, but huge in aggregate.


And if you run a lot of servers it means you need fewer servers. These kind of changes can literally save millions.


Which is why the organizations which stand to save millions by building their kernels with PGO already do it.


Some do. I bet lots of companies miss out on massive cost savings like this simply because nobody at the company has the skills and access to improve things.

I heard a story - years ago Google hired an engineer who happened to be an expert from a previous life in video codecs. He wasn’t working on YouTube at Google, but out of interest he pulled up the YouTube source code to see what it did. They were just using the defaults for some of the encoding parameters. He tweaked a few of the encoding parameters and in doing so saved Google millions of dollars per year in compute/storage/network traffic. A few hours of his work was probably more beneficial for Google than years spent in his primary role.

And if tuning opportunities like this abound at Google, you know they’re everywhere. It’d be much better if Linux distributions simply shipped kernels which are already compiled with PGO, with a reasonable profile based on normalish use.


That could've happened because of pointless detuning - x264 comes with good defaults for everything, but then ffmpeg on top of it used to set everything to "off" no matter what it was, and they were probably using the ffmpeg settings.

The correct answer would've been for ffmpeg to not ship that way.

Anyway, I am an expert on video codecs just like that guy is, and I said what I said ;)


Probably, but there are many smaller outfits that probably don't because they have a server fleet of "only" a few dozen servers or so. The cost savings for those won't be in the millions, but having to purchase ~5% fewer servers is still a pretty good win.


Most smaller orgs are wasting resources on poor utilization. Tighter code just gets those orgs even worse utilization.


Counterexamples:

Running your DB with the default parameters.

Not running Linux kernel parameters depending on your workload. (+10Gbit NICs, lots of connections...)

There are reasons, why those knobs are there, as there are always trade-off, and having them automatically adjust is very hard.

But maybe that was what you're referring to.


That highly depends on your problem. I have some scientific code that gets more than a factor 10 improvement on runtime just by turning on some compiler flags (mainly -O3 and -march, but some others as well).


And to illustrate the point more, I have found that -Os often produces faster code than -O3. Not by factor 10, but still clearly measurable.


Getting better performance with -Os over -O is some kind of edge case and people shouldn't generally expect that. -Os has disastrous consequences for C++ algorithms because it refuses to inline functors, so while you may rightly believe that a C++ std::sort is slightly faster than a C qsort, due to superior opportunities to optimize the C++ code, with -Os you'll find that std::sort is an order of magnitude slower. Definitely pays to check the result with a full-scale benchmark.


Interesting claim. I had to try this, and with the Xcode version that I have installed, std::sort with a simple lambda as comparison function gives about the same result with -O3 and -Os. I would not call this a disastrous, but of course opinions differ. Interestingly, qsort is significantly faster with -O3 than with -Os but of course nowhere near std::sort.


Results will vary for small vs. large programs. I've seen catastrophic space optimizations that out-of-lined very small methods like std::vector::at, because the call was one or two bytes smaller than the inline. Lambdas are inlined even with Os because they don't have names or multiple callers and can't be made smaller by out-of-lining. A functor class, or any function with multiple call sites could trigger the problems with Os.


Ok, I'm not continuing with this. Just note that I didn't write that -Os would be always or even usually faster. I could try to come up with an example where -O3 produces a huge loop preamble for loop that's iterated once or just generates enough cache misses to be overall slower, but I don't care enough.


I also have a similar scientific code base which gets 10-15x wall-time improvement just by enabling compiler optimizations.


Not a scientific codebase, but ffmpeg seems to benefit massively from optimization

I use it to encode my DVD and Blu Ray TV/movies to HEVC. Doing so reduces the filesize on DVD's by roughly 80% and Blu Rays by 60%

The downside is the encoding is an incredibly CPU intensive process. Hardware encoders like Nvidia's Nvenc or Intel's QuicSync look absolutely terrible and are a non-starter for archival storage

On a stock Fedora XFCE install, I would get roughly 0.5 FPS for a 1080p Blu Ray file (29.97 FPS at 1920x1080)

A Gentoo Linux installation with a global -O3 -march=native as well as LTO, PGO and Graphite enabled globally boosts it from 0.5 to roughly 1.3 FPS. Still slower than realtime, but an absolutely massive improvement


I suspect -march=native is the only thing doing any work here. The other optimizations are as likely to find compiler bugs as they are to improve things, once you get off well-tested paths.


I think it's the opposite. A good annotated code which can use SIMD instructions accelerates tremendously with -O3. -march=native -mtune=native generally improves things if you can saturate the cores with instructions, which can be measured by running perf record and looking to IPC & instruction retirement numbers.

When I was using Eigen on my code, the biggest performance boost came from -O3. -march and -mtune did minimal improvements on the systems which I've ran benchmarks on.


ffmpeg and associated projects do not rely on autovectorization, which pretty much only works on naive scientific code. They write their SIMD in assembly and don't need to care about compiler settings.


If you're compiling hand tuned assembly, -march and -mtune probably will have more effect when compared to compiling and optimizing C/C++ code.

OTOH, I'd like to underline that heavily optimized scientific code and libraries are neither naive (in terms of algorithmic complexity/implementation) nor straightforward :D

E.g.: This is how Eigen configures its internal vectorization parameters: https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/Co...


That's an amazing compression ratio, how is the quality?


ffmpeg recommends a crf of 23 to preserve 1:1 for H.264 -> h.265

I do a probably overkill CRF of 20 just to be safe. But everything looks absolutely perfect, even blown up on my 65 inch (1080p) TV


Can you share your complete parameters, so I can do the same thing for my DVD archives & other stuff?

No hard feelings if you don't want to though :)


Sure! This is the script I wrote for it. I use MP4 containers so I can import them into MAGIX (formally Sony) Vegas but it will iterate over an entire folder of MakeMKV files and convert them https://dpaste.com/DBPA9C59M


Thank you. It's greatly appreciated. :)


I don't know if you would consider it tuning, but a low latency kernel is a massive UX improvement on consumer devices, where you don't want the UI to stutter and lock up under heavy load, and don't mind paying (could be wrong, top of my head) like a 5% throughout penalty.


We just improved the performance of one copying tool by a factor of 2.9, by tuning the size of a memory buffer. (BTW a shout out to both Flamegraphs and hyperfine - both excellent tools for profiling and benchmarking respectively)


Tuning kernel/process priorities has made a big difference in my audio code. Tweaking the GPU and USB IRQ handler priorities let me do realtime audio together with realtime OpenGL data visualization and screen recording, where before the audio would skip or the visualization would hang.


As someone with a limited knowledge of kernel stuff, does this mean Linux is likely to significantly outperform other comparable kernels like FreeBSD and OpenSolaris? Or are those other kernels keeping pace?


Considering the absolute dominance on HPC for a few years now, I don't think there is competition against Linux at the moment in terms of performance. Last time a non Linux system appeared in the top500 list was in June 2017.


I suspect that's more due to "popularity" or richness of the ecosystem than performance necessarily.


Isn't cost also a factor at play here?


There is some more in-depth discussion about the core folio idea on LWN at https://lwn.net/Articles/849538/ from a previous iteration of the patch set,


Pretty good explanation.

I don't quite follow why it's all being done as one huge set of patches -- rather than first merge the groundwork and then all the conversions.


> I don't quite follow why it's all being done as one huge set of patches -- rather than first merge the groundwork and then all the conversions.

It is. The patchset submitted contains only 33 patches, while the full patchset contans about 200 patches.


A while ago I switched my main rendeeing machine to linux because blender rendered roughly 10% faster on it back then.

This is equivalent to saving ~16 hours when you have a week of rendertime.


This was a 7% perf boost compiling the kernel, which does a lot of small file IO and memory allocations and thus stresses the MM code in the kernel.

Rendering is much more 'pure cpu' work, so you most likely won't see much difference there due to this work.


You'd think so, but it's just not true. Linux has a substantial performance advantage over Windows for rendering, which is (partly) why all the large render farms use Linux instead of Windows.

The most sensible hypothesis I've heard about it is that THP really helps with large rendering loads, and Windows doesn't do that yet.


For reference - THP -> Transparent Huge Pages.


From my point of view of an alternative life on the game development subculture I am quite sure the free beer weights much more than a couple of hours.


I'm not entirely sure what that means, but as a person running Linux who has been giving early access feedback to a few indy game devs, the consensus among them seems to be "you don't support Linux because you want to reach a larger market, you support Linux because you'll get good bug reports and basically free Q&A".


I am not speaking about indies, rather traditional businesses.

As for bug reports, well https://twitter.com/bgolus/status/1080213166116597760


I'm not saying you're wrong, but these really aren't the same subspecies of indy devs that we're talking about here, hahaha. I'm talking about much smaller teams with a much smaller budget and audience. Like, teams of one to five people.

In my experience in interacting with them, most of these small indy dev teams use Unity or Unreal or Godot these days. Outside of graphic design, the graphics don't really cause big issues any more. Having a team of a handful enthusiasts but often slightly inexperienced programmers figure out the flaws in their own game logic is.


Yeah, however those handful enthusiasts are surely not in the "why all the large render farms use Linux instead of Windows" bucket.


I believe GP is referring to the 10% boost they observed, not a hypothetical 7% boost as in the article. (7×24×0.1 = 16.8, while a 7% speedup would result in ~12 hours.)


Ah, right, I thougt GP assumed they would see another 7% on top of the previous 10%.


I think the 16hr thing is 10% of 168 = 1wk of computation, not the 7% referred to in the title.


Imagine if you have a whole renderfarm


One thing I've been mulling over recently is that many containers like vector in C++ for example, have almost no state.

That is to say, we at most might have a bit of logic to tune whether we do *1.5 or *2 on realloc, but why not more?

There must be patterns we can exploit in common use cases to be sneaky and do less malloc-ing. Profile guided? Runtime? I might have some results by Christmas, I have some ideas.

Food for thought: Your container has a few bytes of state to make decisions with, your branch predictor has a few megabytes these days.


The size increase factor for vector is a compromise between performance and wasting memory. It's also a fairly hot code path, so you don't want to run some complicated code there to estimate the 'optimal' factor.

About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().


Malloc is at best a few hundred cycles, that's quite a lot of work if you can keep your data in L1. If you do a bit of work now you can save reallocs later.

Think bigger than changing the coefficient, there probably is no optimal factor.


Isn't that basically what slab/pool type allocators help with or am I being an idiot?


> About the best you can do is if you know beforehand roughly how big it will be, is to reserve that capacity with std::vector::reserve().

But be careful not to call reserve() in a loop adding data to a vector - will allocated exactly what you request and not do exponential growth.


std::vector would mostly benefit from an improved allocator interface, where it requests N bytes, but the allocator can give more than that and report the actual value.



For example folly's vector performs this optimization: https://github.com/facebook/folly/blob/master/folly/docs/FBV...


While some heuristics would be nice if they improve the situation, a lot of apps still leave performance on the table by not estimating the capacity well. You don't have to be very clever about growth if you know that this vector will always have 3 elements and that one will have N that you can estimate from data size.


Please don't give the C++ people any more ideas, my compile times are already bad enough!


This could actually make your compiler a lot faster, compilers are really hard to maintain an efficient memory strategy in due to all the moving parts


It's surprising that we have stuck with 4 kB pages on x86 since the 386, even though computers have ~10 000x as much memory now (4 MB -> 32GB).


I wrote a thesis on this in 2008 (Transparent large-page support for Itanium Linux https://ts.data61.csiro.au/publications/theses_public/08/Wie...) and Matthew Wilcox was already involved in the area then. I admire his persistence, and have certainly have not kept up in the state of the art. Itanium had probably the greatest ability to select page size, probably more than any other architecture (?). On x86-64 you really only have 2mb or 4k to work with in a general purpose situation. It was difficult to show the benefits of managing all the different page sizes and, as this notes, re/writing everything to be aware of page sizes effectively. Those who had really big workloads that benefited from huge pinned mappings didn't really care that much either. It made the work hard to find traction at the time.


There is a series of page sizes that steps in increments of 5 bits per width. 8k/256k/8M therefore have regular join points in the address space, the first of which is at 8G.

Does superpage management get easier when each superpage is composed of only 32x of the next size down? When I first stumbled on this idea, it seemed like it would have many more opportunities for forming intermediate-sized superpages.


Linux still has a lot of assumptions baked into the page size. Power9 and some aarch64 systems have 16kB pages, but occasionally you run into some corner cases - for example, you can't mount a btrfs partition created on a x86 machine on a power9 one because the btrfs page size must be >= the mmu page size.


64K pages, actually. Also, POWER9 and aarch64 are perfectly capable of running with 4K pages, but not everything does. My desktop POWER9 with Fedora is a 64K page system, but Void PPC runs the same hardware with a 4K page.


That's the fault of btrfs (ext4 or xfs handle these cases just fine), and even btrfs is getting support for it in the latest releases


While you may be technically correct, I guess from a practical point of view, OP is correct in that page-size still matters if you trying to run Linux on "unconventional" platforms.


RHEL used 64k page size on aarch64 for a while. I believe we have switched back to 4k. It caused some problems, from memory:

* Blow-ups in various kernel data structures. There was some virtio code which was allocating N pages per driver queue.

* Problems with GPUs, either the driver or the firmware assumed 4k pages. (Edit: This actually affected Power, not ARM, but the issue is caused by page size: https://lists.fedoraproject.org/archives/list/devel@lists.fe...)

* Filesystems make assumptions about page size versus block size.

* Processes generally take more RAM, with RAM wasted because of internal fragmentation.


> RHEL used 64k page size on aarch64 for a while. I believe we have switched back to 4k.

Last I checked, at least CentOS 8 (which should be the same as RHEL 8) is still using 64k pages (search at https://git.centos.org/rpms/kernel/blob/c8/f/SOURCES/kernel-... for CONFIG_ARM64_64K_PAGES=y).

And AFAIK, to access the maximum amount of physical memory in AARCH64 (52-bit physical addresses, instead of 48 bit physical addresses), you must use 64k pages. Since RHEL is normally used on servers, it makes sense to want to be able to access huge amounts of physical memory; that's probably the true reason (or even the sole reason) RHEL uses 64k pages on AARCH64.


AMD64 supports only 48-bit addresses, and is still the most popular server ISA.

  You have: 2^48 byte
  You want: tebibyte
          2^48 byte = 256 tebibyte
          2^48 byte = (1 / 0.00390625) tebibyte
how many servers do you have with more than 256 TiB of RAM?


56-bit support was specced and had software support already in ~2016. 48 bits is not a hard architecture limitation. see eg https://www.kernel.org/doc/html/latest/x86/x86_64/5level-pag...


It's not only ram that goes in that 256 TB, memory mapped i/o uses it too. You probably don't have TB of video card RAM, but maybe some flash devices offer the whole thing memory mapped?


Good points. Re the last one: you of course also save some by having fewer pages and their metadata. Wonder what the space-optimal page size would be taking these opposing factors into account.


I bet this would also screw masses of user code making assumptions about optimal size, or simply designed/optimized for a smaller size. LMDB was the first thing that came to mind


Indeed, the M1/A14 on mobile has larger pages which lets them have more effective TLB coverage with a smaller cache. In some applications this can boost performance by double digit % (which you can simulate by enabling large pages on x86).


A lot of peripherals have 4kB address space. So it becomes complicated if you change to 16kB as you'll be mapping in more than you bargained for.


You can also do large pages, 2MB or 1GB or whatever the obscenely large page size is for 5-level paging on latest systems.

2MB vs 4kB isn't quite the same ratio as 4MB -> 32GB, but it's still a lot less pages to cache in the TLB, and it's not too big to manage when you need to copy on write or swap out (or compress with zram) and whatever else needs to be done at the page level.


The other day I was wondering what would happen if all operating systems would stop developing and only optimise for a week or 2. How much time and electricity could have been saved?

If you add up an optimisation of just a nanosecond in like openSSH, how much would that do globally?


I used to be a kernel developer at Apple starting in 2006. Internally, every alternate major release was exactly this. All common paths were identified, and most of the dev time on the release was spent on optimizing those features to hit a goal for the path. Eg. moving 100 files in finder should not take more than x ms


The upgrade from Leopard to Snow Leopard on a plastic MacBook just made everything better. Things were faster, smoother, and you could run more things at the same time without killing the machine. It was the perfect OS. Then when Lion came around, it was the exact opposite, it felt terribly buggy, and made everything clunky and worse. At least that's my memory of these OS updates.

Makes me wonder whether this alternate major release cycle was a good idea. If you delay all feature development for a year, you'll get a barrage of features once the performance-only OS version is out the door, and there's not enough time to do all of them properly, so you get buggy and slow versions.

Maybe doing performance improvements and feature development at the same time would have been the better choice? How is it being done at Apple nowadays?


I do not know, Apple is a place where most practices are need to know only.


> If you add up an optimization of just a nanosecond in like openSSH, how much would that do globally?

I believed optimizations like that at a global scale will not have any impact.

Lets say that this nanosecond will be saved trillions of time a day. Resulting in minutes to an hour a day saved globally.

* Not a single user will notice. * In 99.99% of cases the CPU will not be fully pegged and thus that one nanosecond of compute will not be used to do something else at all. * CPU throttling isn't that fast so, you won't even save that much power.

If we bump it up by 6 orders of magnitude to a millisecond that all remains true. Even though you are potentially saving 100s of years of computing time a day. Extremely small gains distributed across very large number of machines don't tend to be as impactful as you would hope on a global scale.

This is not to say that small gains are worthless. Many small gains added together can be substantial.


If you make a bicycle one second faster over 40 kilometers, the user does not notice. But the user does win the race by 1 second instead of losing it. That is, things can be of enormous utility even when the user doesn't notice.


It's not the through put, but the fact when the CPU is idle it is sleeping saving power. This is true on mobile, not for servers as internally they'll poll their ethernet phys.


Very optimized projects are out-competed by well-factored but less optimized projects, because the latter ones can add features faster. Thats why we are where we are today.


In the new technology sector this is certainly true. Find your product-market fit and then optimize when you become mature and scale up.

But aren't we talking about extremely mature kernel code here? My impression is that all kernel distros in high use are optimized but they are general use software. The degree to which you may optimize software is constrained but the diversity of use cases you must support.


A great deal of software has room to be made an order of magnitude faster with no negative impact on maintainability.


> There does not appear to be a way to tell gcc that it can cache the result of compound_head()

Isn't this what __attribute__((pure)) [0] is for?

[0] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attribute...


As I understood it, pure means a function's output is dependent only on its input:

  y=f(x);z=f(x) implies y=z
What they want is something different:

  y= f(x) implies y=f(y)
This means if you give something the head of a list of pages, it won't try to go to the head again and again, it knows its already there.

The 'folio' idea as I understand it is roughly an alias for the existing 'page' structure, but code knows it is already at a head AND it should do the work on the whole list, not only on the head.


It sounds like the quoted benchmark is for XFS, so other filesystems may be different?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: