Hacker News new | past | comments | ask | show | jobs | submit login
I've had bad luck with transparent hugepages on my Linux machines (utcc.utoronto.ca)
134 points by ingve on Feb 1, 2023 | hide | past | favorite | 75 comments



THP are a quick win for people who have little control over how malloc behaves. Like Python and numpy, for instance.

I did a lot of HPC modelling ranging from hundreds of GiB to TiB-sized RAM servers and THP was an instantaneous win over not using it. Later on I experimented with LD_PRELOAD_PATH and libhugetlbfs and while it did stabilise things even more and reduced time spent in the page table, it was not even several factors 'better' than THP.

A large part of this really boils down to your performance needs. Deterministic modelling -> fairly stable and reproducible malloc patterns -> THP will probably work OK.

If your memory usage spikes a lot then THP is probably more hassle than it's worth. The kernel will spend an eternity thrashing and kicking the page table and trying in vain to clean up after itself. I can see why people feel THP sucks under those conditions!

If anything, the real crime here is how awful Linux is at HPC without a lot of tuning and careful tweaking. Throw NUMA into the mix, file caches and the OOMkiller and it feels like it never moved out of the 90s. Combine it with poorly-configured hypervisors and your performance will yo-yo and you'll spend an eternity trying to figure out why (ask me how I know...)


I agree about huge pages in HPC, but regarding Linux -- if you want high performance (at least on a system designed for differing workloads), surely you should expect to do performance engineering. Whether or not you should be doing HPC on such a kernel is another question, but alternatives haven't caught on, perhaps because of the way applications which they need to support are written.

If OOMkiller kicks in it suggests a resource manager issue (not accounting memory). Then, you can't blame the kernel for NUMA, which you presumably want for performance; I don't understand the difficulty people have with pinning processes after many years of it being necessary. There's something to be said for userspace filesystems too, like the venerable PVFS. The answer to hypervisors should be don't do that.


I don't mind doing the engineering. But penetrating the veil of some of these things, ill-defined as they are, with work on certain areas being sporadic and taking place of years or decades, makes it especially hard.

In fact it's the _absence_ of choice that hurts more than it is a surplus of choice.

As for OOMkiller: I think we're both wise enough to know that experimentation is a large part of what any team that consumes gobs of RAM do. So talk of a "resource manager issue" is all well and good in prod when you should have a reasonable handle on what's used by what and for how long. Less so when you're scaling a model --- say a monte carlo model -- to a larger number of simulations in development and you're testing things. When you're paying an awful lot of money for hardware you start to count (or you should, if you respect your company's money) the costs of things like this.

Regardless, the OOMKiller will indeed reap stuff; and more often than not, it'll pick something shouldn't (it's probabilistic and wrong as much as it is right) and that can cause headaches.

As for NUMA: I'm not blaming the kernel for anything :-)

And hypervisors are sometimes a given, and not a choice. We don't all get to pick our hardware, nor what hardware is made available to us.


Definitely there's a lack of shared wisdom about tuning parameters (along with other things in HPC, at least in the circles I see), and too much churn with Linux v. userland (e.g. cgroups).

Actually, I was forgetting the stupidity of the memory cgroup invoking the OOMkiller, rather than giving ENOMEM, if you purely use the cgroup for memory accounting -- a sensible-looking change from OpenVZ was rejected. That means there's probably no indication about what's happened unless the job is correlated with the syslog message. As I don't get to do that stuff any more, I haven't looked for a way to hook it now. There's obviously a window in which it can fail, but the resource manager can track the job's PSS, as well as at least ulimiting data and stack. You certainly don't want the job to start paging, if you have swap -- that definitely wastes the resource. If you really don't want to limit jobs' memory use, the resource manager can at least adjust the oom_score.


When you say:

  the real crime here is how awful Linux is at HPC without a lot of tuning and careful tweaking
To what are you making the comparison? Are *BSDs better? HP-Unix? AIX? Windows? What's the competition in this space that makes Linux look bad in HPC?

Edit: Added "in HPC" at the end for clarification.


I’m curious if GP has any positive examples too, but did want to mention that being “the best of the awful” doesn’t make one not “awful”.


Typically being 'best of awful' means you have both a difficult problem and a very small dataset to work with where the solutions are not easily generalizable. You have to look at each and every HPC workload as a custom application where very small changes in your input and environment could have massive behavior changes in the application performance.


I don’t disagree with you, but that doesn’t negate GP’s or mine’s point.


I don't think there has to be a better, existing alternative to say something is terrible. AFAICT, this is just generally a hard space to get right, especially by default, on a system that will power both a data center - size supercomputer and system that is barely me then a microcontroller.


The competition is naturally UNIX based OSes, given the historical background of such systems.

Even the workloads that use GNU/Linux, most likely are heavily customized versions provided by IBM, HP and co.

https://www.ibm.com/high-performance-computing

https://www.hpe.com/us/en/compute/hpc.html

I also remember that for a while IBM's xl compilers were used quite often, not sure if that is still the case.

See https://www.top500.org/


The competition probably shouldn't be full Unix, and in Blue Gene, for instance, it wasn't on the compute nodes.

I don't know exactly what the current Cray environment is like -- it used to be rather odd -- but most HPC systems run normal EL-ish distributions. Summit is RHEL8, and mostly uses GCC, not XL according to https://gcc.gnu.org/wiki/cauldron2022talks?action=AttachFile...


Sure, however they are mostly UNIX inspired if you will, as anything else.


Single-user, single-program doesn't seem very Unix-y to me: https://en.wikipedia.org/wiki/CNK_operating_system There was also Plan 9, but I don't know if it was ever actually used: https://doc.cat-v.org/plan_9/blue_gene/


There has been an ongoing thread that claims that hpc workloads need no kernel or very little and that sharing resources doesn’t make sense at that grain size.

This makes sense to me


Conversely we found in a microbenchmark the other day that allowing THP more than doubled the speed.

Note glibc lets you turn on and off THP per process which is pretty useful for benchmarking if it helps or hinders performance.

  $ hyperfine ' nbdkit -U - data "1 * 10737418240" --run exit '
  Benchmark 1:  nbdkit -U - data "1 * 10737418240" --run exit
    Time (mean ± σ):   3.658 s ±  0.049 s    [User: 0.406 s, System: 3.242 s]
    Range (min … max):    3.576 s …  3.713 s    10 runs

  $ hyperfine ' GLIBC_TUNABLES=glibc.malloc.hugetlb=1 nbdkit -U - data "1 * 10 737418240" --run exit '
  Benchmark 1:  GLIBC_TUNABLES=glibc.malloc.hugetlb=1 nbdkit -U - data "1 * 10 737418240" --run exit
    Time (mean ± σ):   1.655 s ±  0.007 s    [User: 0.299 s, System: 1.350 s]
    Range (min … max):    1.643 s …  1.666 s    10 runs


We achieved more than 3 times speedup using "on-demand" transparent huge pages in ClickHouse[1] for a very narrow use-case: random access to a hash table that does not fit in the L3 cache but is not much larger.

But there was a surprise... more than 10 times degradation of overall Linux server performance due to increased physical memory fragmentation after a few days in production: https://github.com/ClickHouse/ClickHouse/commit/60054d177c8b...

It was seven years ago, and I hope that the Linux kernel has been improved. I will need to try "revert of revert" of this commit. These changes cannot be tested by microbenchmarks, and only production usage can show their actual impact.

Also, we successfully use huge pages for text section of the executable, and it is beneficial for the stability of performance benchmarks due to lowering the number of iTLB misses.

[1] ClickHouse - high-performance OLAP DBMS: https://github.com/ClickHouse/ClickHouse/


This is why the main feature of a "hugepage aware allocator" is when and how to deallocate, to avoid fragmentation or exploding a collapsed page. https://google.github.io/tcmalloc/temeraire.html


In what version(s) does that tunable work? I don't see evidence of it in RHEL8 or Debian 11 (where /lib64/ld-linux-x86-64.so.2 --list-tunables doesn't work).


glibc 2.34 and above. That version isn't in any RHEL (will be in RHEL 10).


Transparent hugepages seem mostly useless, but standard hugepages can be a lifesaver. I have a postgres server with 1TB RAM, and without using hugepages the machine would regularly fill the page table causing an OOM. With a suitable number of 1GB hugepages assigned, it is rock solid.


A database server seems like the ideal use case for (real) huge pages: a single application with a low-ish number of threads on a machine with a huge amount of RAM. Lots of page table misses with normal sized pages, but very little memory wasted from huge pages and no memory contention from other software.

Not coincidentally, Microsoft's primary use case for 1GB pages on Windows is MS SQL server.


> Microsoft's primary use case for 1GB pages on Windows is MS SQL server.

It is automatically enabled, but only if you have the "Lock pages in memory" privilege assigned to the database engine service account...

... which the default SYSTEM account has enabled by default...

... but not if you set up a typical cluster with a domain user account.

So in other words, the scenario where you would want this enabled the most -- large expensive enterprise clusters -- is where it is accidentally disabled with no warning.

I use this as a free 10-20% performance boost. Add in a few other similar tuning settings and I can get a 50% pref improvement on just about any "enterprise" cluster without having to get clever.

Turbo buttons are fun to press.


> Add in a few other similar tuning settings

Have you ever written up some of the things you do to improve performance, or is that something you're unable to share?


The bare list of safe "turbo buttons" is below. Just google them for the specifics and the benefits:

- Enable the "Lock pages in memory" privilege.

- Enable the "Perform volume maintenance" privilege.

- Enable the "Ad hoc query optimisation" setting.

- If on-prem, set power management to "High Performance", ideally in the hardware BIOS, but failing that, at the hypervisor level.

- Upgrade the OS and DB engine version if this is compatible with the apps.

- Run sp_Blitz and implement any critical recommendations.

The below recommendations apply to cloud-hosted servers only:

- Upgrade to the latest gen DB-optimised VM sizes. In Azure I use Ebds_v5 and Eads_v5.

- Use the "temporary storage" SSD volume for tempdb.

- Consolidate storage. Striped a single large logical volume across multiple 1 TB disks. NEVER use the 1990s era many-small-volumes scheme of having separate data, log, tempdb, and templog drives!

- Move the DB VM and the App VMs into a single "Proximity Placement Group" or the equivalent construct.

In my experience the above list generally yields a total performance improvement of anywhere from 40% to 300% with minimal risk.

If you sprinkle some light tuning on top, such as judiciously creating a handful of indexes with a high predicted performance benefit can take you even further.

Past that, you have to get elbow deep into the schema, which is typically only possible with "in house developed" databases.



And lose a source of income? This isn't how business is done in MS lands.


I'd assume the biggest turbo buttons are already on stackoverflow. Some may depend on specific database usage patterns.


Would love to hear about the other tuning settings.



> machine would regularly fill the page table causing an OOM

That's quite interesting. How big the page-table must be on your server to directly or indirectly cause an OOM?

4-level page-table data-structure can address 512^4 page-tables. A single page-table can contain 512 entries. And each page-table entry is 64 bytes large. Unless I missed something, this means that upper bound memory consumption of page-table data-structure is 512^4 * 64B and which is 4TB of RAM so I guess it is theoretically possible to OOM on a 1TB machine.

What I don't understand is how huge-pages would help you mitigate this problem because 2MB huge-page entry will essentially be made up from 512 entries from a single table. Linux call them compound pages. And given that all those entries will be vacant, and in-use for that particular huge-page, upper bound memory consumption will remain the same as with 4K pages.

It has been recognized that compound pages might be wasteful because all those 511 entries are basically pointing to a head entry [1], so unless you're running recentish version of kernel on your system, you wouldn't see this advantage.

[1] https://lwn.net/Articles/839737/


This is a machine which is dedicated to postgres. For the particular workload (heavy update) and filesystem (ZFS), it appears from experimentation that optimal performance is achieved by dedicating most of the RAM to postgres' shared buffers, and not really relying on the filesystem cache at all. (As an aside, that might be unusual, the "standard" advice seems to be to set shared_buffers to a smaller value and to rely on the OS cache.)

So, I have about 95% of the RAM assigned to postgres shared buffers. Without hugepages enabled, this was fine when the database started up, but as (presumably) the buffer cache became fragmented over time, the size of the page tables grew to a point where the machine ran out of RAM. I do have vm.overcommit_memory = 2 (never overcommit), but even so the OOM killer was still invoked because, well, the machine had no RAM left. I can't remember exactly how big the pagetables had gotten in this situation, but I think it was a few tens of GBs.

I'm actually using 1GB hugepages, but in all honesty I'm not sure how Linux deals with them or how postgres claims them. From /proc/meminfo it looks like the page table is quite small, with the "Hugetlb" and "DirectMap1G" numbers both around 1TB.


Yeah this mirrors my experiences.

I, too, find the file cache to be significant performance impediment on high-RAM systems. Worse, its behaviour is non-deterministic (or at least hard to reason about.)

It leads to page table fragmentation and that is the ultimate performance destroyer. Huge page tables are a godsend in this case, but you do need a program that is properly optimised to handle them properly. At least most RDBMS do this reasonably well.


If you assign 95% of the memory to postgres, then yes, it is realistic for page-table to grow several GBs under long and heavy workloads so machine will eventually run into an OOM, especially if you use vm.overcommit_memory = 2 (never overcommit).

Mitigating the OOM by switching to hugepages may suggest that you're running a kernel with the optimized page-table hugepage handling because of which there are less page-table entries and consequently the whole page-table ends up being smaller. Currently, I have no other explanation.


I only looked curiously at the relevant Linux code. But my understanding is that compound pages don't show up in the page table of a process. Yes, there's an associated data structure with each physical page, but that's separate from per process tracking.

The problem with the page table and postgres is that they're separately maintained for each process, even if all processes share the same shared memory area.

With 1GB or 2MB pages that's not too bad. But with 4k pages you can easily end up with dozens to hundreds of MB for each connection. Wasted memory + wasted cycles (tlb misses).


There's a new Linux feature called Shared Page Table Entries that allows processes to share these

It was requested by the Oracle DB team

https://lwn.net/Articles/919143/

I recognize your username so you'd know better than most whether Postgres would be able to use this or not though


I did vaguely follow the effort - last time I checked nothing seems to have been merged yet. Did it get merged since?

I'd expect it to benefit postgres substantially.


It doesn't appear to be merged yet in any of the 6.2 RC's from what I can see =/

I am eagerly awaiting its release as a DB enthusiast and lifelong Postgres user


Compound page is just an implementation detail how page-table data structure represents the pages larger than 4K. And this is reasonable to me because in order to create a mapping for 2MB page, kernel needs to group 512 consecutive 4K pages, and per my understanding this is exactly what the kernel will do. For a 1GB page it will have to group 512 page-tables, each containing 512 4K entries. There is no separate page-table data structure that the kernel will maintain for 2MB or 1GB pages. There's only one.

However, you're right that multiple processes maintaining separate page-tables where each is containing mappings to the same physical memory regions can be wasteful.

Still, due to how huge pages are represented in the page-table data structure, I don't see what difference would it make to switch to 2MB or 1GB page, unless you're running a recentish kernel (~3 years) that has been specifically optimized for exact this case, e.g. to reduce the memory footprint of huge-page representation in page-tables. Specifically, a single 2MB huge page is now represented by 8 page entries (512 bytes) instead of 512 page entries (2MB). And there was another patch proposing to cut 8 page entries to only 2 page entries.

In any case, that's a dramatic cut of memory footprint when compared to how it used be and which is why I believe that this is the actual reason why OP sees the difference.


Oh, so what I had assumed was some kind of page table fragmentation was in fact due to the number of postgres clients increasing, meaning that the number of page table entries increased due to duplication between processes?

That would make sense, in retrospect: the OOMs almost always occurred at the point of highest client load.


Interesting - would you be able to expand a bit on why hugepages makes a difference for Postgres?

I have only the most basic knowledge of hugepages - picked up on it being a necessity for stable performance in VMs, but don't understand why it would matter for Postgres as well.

And do you end up having to configure Postgres to use hugepages, or does it pick them up automatically?


It's the difference of configuring the kernel to pre-allocate the contiguous memory for huge page allocation vs. _transparent_ huge pages that requires the kernel compaction process.

Postgresql needs to be reconfigured to use them properly. The shared_buffers setting needs to be adjusted to allocate in [huge page size] units.


I appreciate that, thanks for the explanation. I understand the difference between hugepages and transparent huge pages, I just don't understand how not using hugepages was impacting the parent's Postgres, or causing OOM errors.

@menaerus actually has a great comment on this that digs deeper into it, looking forward to seeing what comes out of that thread as well.


I gave a more detailed response on the post by 'menaerus, but to answer your question on config directly, you first need to enable the appropriate number of hugepages with a kernel boot parameter, e.g.

    hugepagesz=1G hugepages=940
and then add these config lines to postgresql.conf:

    huge_pages = on
    huge_page_size = 1GB


I wrote a post on how to measure perf impact and pitfalls of THP some time ago. I hope it helps https://alexandrnikitin.github.io/blog/transparent-hugepages...


Have you tried turning off THP and just using -XX:+UseLargePages? It always seems to do same-or-better for me and I don't have to worry about some other process accidentally stealing my (precious) huge pages.


Besides ZFS, VMware Workstation on Linux also interacts badly with transparent huge pages: https://unix.stackexchange.com/questions/679706/kcompacd0-us...


Had that problem too, and it's not the first time VMware was hit.

My "tune for vmware" script has grown to three lines over the years:

   echo never > /sys/kernel/mm/transparent_hugepage/defrag
   echo 0     > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
   echo 1     > /proc/sys/vm/compaction_proactiveness # default 20


4k pages are too small and 2 MB pages too large. I wish we x86 users had 64 kB pages.

4 kB pages reduce CPU performance due to frequent TLB misses and make virtual memory perform worse due to number of faults.


On Power ISA the most common page sizes are 4K and 64K. My Raptor Talos II has 64K pages in Fedora.

That said, enough useful "workstation" things assume 4K and break on 64K that I'm hopeful aarch64's use of 16K pages will get people to realize everything doesn't use the same page size. I see transparent hugepages as trying to have your cake and eat it too.


What things typically fail? (Nothing I've noticed on an HPC system running a few desktop-ish things under x2go.)


For awhile, Chromium was the biggest offender, and some places may still ship incompatible Electron binaries. I don't use Chrome but I'm told this is now fixed. Hangover also needs 4K, and some kernel modules assume it. Those are all off the top of my head but overall it's starting to get better.


Even 1MB pages might be desirable for modern hardware. It would be really interesting if there were some charts ran against different types of workloads. E.G. games, databases, compiles, etc. More interesting still if TLP (Transparent Large Pages) could be set to a given size per... cgroup maybe?


Different applications would certainly benefit from different page sizes.

Use 1G pages, and boom, TLB misses and faults are just gone. But definitely not a good default choice!

Any choice is a tradeoff. The larger the page size, the more physical memory is going to be wasted. On the other hand, I guess with the current NVMe hardware the effect on swap might not be too bad.

64 kB pages would mean the number of TLB misses and faults are just one sixteenth of the current status quo. Diminishing returns and all.


On AArch64 you can have multiple combinations of mapping sizes:

4k/2MiB/1GiB

16KiB/32MiB

64KiB/512MiB/4TiB


Is this is a per-CPU model choice? Or are there CPUs that allow selecting the page sizes (at boot time presumably)?


The latter.


Pretty sure that is exactly what GP is referencing when he explicitly wishes it for x86.


Yes indeed.


I am recently studying hpc configs and the docs I follow most (amd,suse) recommends blocking thp by default which I was able to replicate performance impact myself as well, on the otherside standard hugepages brings lot of advantage, I do setup during kernel boot but didnt test properly, currently using 4GB. If you really wanna bypass page latency I recommend buying a cheap persistence memory, not all filesystems support yet , (ext4 and xfs ?) but with -o dax mount option you dont use pages, if data is temporary tmpfs with hugepages as far as I can see quite common usage but I also found kernel memblock argument by simulating non-existent persistent memory and mounting a filesystem with dax brings similar performance or even faster most of my cases yet still you wanna script each time boot mkfs/mount


Transparent hugepage doesn't work well with VMWare Workstation. Everytime I install Workstation on a Linux machine, I would have to disable hugepage after khugepaged 100% of a CPU core and drag the VM into a molasses-like state.

I believe the consensus is that Workstation makes lots of memory allocation and fragment the memory, causing the defragment-er to kick in.

Vmware says won't fix lol.


> In the normal default kernel settings, this only happens for processes that use the madvise(2) system call to tell the kernel that a mmap()'d area of theirs is suitable for this.

And everything makes sense now. I was having terrible terrible problems with Java and transparent huge pages. It only affected Java. I think this is because Java was the only thing we were running that would pre-allocate a large chunk of address space for the heap like this. I ended up disabling THP.

However, on a more recent install, the problem magically went away, and I got the impression that the problem was fixed. The article is recently-written, but they're not using an old Linux install, are they?


There's a lot of benefits to going to 1g-pages, but treating them as 256 4k-pages seems almost always to be a mistake; All I ever hear is horror stories. Does anyone have a story of "good luck" with transparent hugepages?


Surely you mean 262,144 4k pages?


Yes indeed.


So the presence of the khugepaged task in the process list gives users something to blame, but it sounds like this person’s system was just out of memory and was trying to do direct reclaim in that task. The use of ZFS is sort of a red flag. That thing is full of leaks.

In future cases it would be useful to look at the stats in /sys to see what it is doing. Under non-OOM conditions it’s not easy for it to use a lot of CPU because it defaults to scanning only small runs of pages and waiting ten seconds between scans.


THP can boost performance on apps that actually allocate massive amounts of memory under specific conditions but it's also had some issues with memory leaks specific to THP that are hard to spot. Have these been fixed? I've not seen any mention of it. I have always disabled THP on everything, especially machines with high memory pressure. On systems that I entirely engineer and control I over-provision memory so it's less of an issue, but on systems that I have provisioned for others I could not prevent them from using all the memory and that is where THP would get people into trouble. I've tried to encourage people to use cgroups with memory limits to no avail. Our work-around was to set fairly high min_free based on a formula suggested by Redhat and set overcommit ratio to 0 to create a safety buffer zone. I should add this was during the 3.x and 4.x kernel versions. I have not done any performance testing with THP on 5.x and rather just disabled it out of habit.


They’re not leaks, it’s just that the slop at the end of a present page is potentially 512 times more than it would be with 4K pages.


The default setting of max_ptes_none is also problematic.

On a stock kernel, it's 511. TCMalloc's docs recommend using max_ptes_none set to 0 for this reason: https://github.com/google/tcmalloc/blob/master/docs/tuning.m...

(Disclosure: I work on TCMalloc and authored the above doc.)


As I commented on another huge pages item, you don't always want to allocate memory with malloc -- specifically putting your Fortran arrays on the stack. Note that in current Debian the THP default is "always". Also "PageTables" in /proc/meminfo hasn't been mentioned yet.


I was really hoping someone would have made a smart joke about “well what did you expect to happen when you cover your computer in transparent sheets of paper.” But alas, there was no joke to be found here.

But to get back on topic, I’ve also had some bad luck with these but that was mostly due to me making the mistake of using them for a database (I didn’t RTFM).


Yeah, it seems the "transparent" is not a very good idea in the end. I made a basic 2MiB page allocator recently, I know I will add explicit 1GiB/512GiB(4-levels page tables) somewhen in the future. But since it is rare, I may go "brutal" with 2MiB pages continuity scanning at allocation time, the alignment constraint lighten a lot that scanning.


Almost amusing to see this article in 2023. It must be > 15 years since I battled THP problems with large jvm heaps, after RH enabled the feature in RHEL. To be honest I thought that since everyone switched to Ubuntu and they had THP disabled by default, the whole thing had been lost to the mysts of time.


I see there is some sort of "page merging" processing going on here, which screams "defragmentation / garbage collection" class of problems, which all seem to have bad edge cases that are not infrequently occurring.

Is that the basic problem?


With our application, THP enabled manifests as excessive system cpu. Have a nice graph of busy system where it dropped from 30% to 5% immediately.


hey




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: