Hacker News new | past | comments | ask | show | jobs | submit login
Disable transparent hugepages (nelhage.com)
178 points by wheresvic3 on Nov 28, 2017 | hide | past | favorite | 60 comments



Please be aware that the article describes a problem with a specific implementation of THP. Other operating systems implement it differently and don't suffer from the same caveats (though any implementation will of course have its own disadvantages, since THP support requires making various tradeoffs and policy decisions). FreeBSD's implementation (based on [1]) is more conservative and works by opportunistically reserving physically contiguous ranges of memory in a way that allows THP promotion if the application (or kernel) actually makes use of all the pages backed by the large mapping. It's tied in to the page allocator in a way that avoids the "leaks" described in the article, and doesn't make use of expensive scans. Moreover, the reservation system enables other optimizations in the memory management subsystem.

[1] https://www.cs.rice.edu/~druschel/publications/superpages.pd...


Hey, thanks for being a FreeBSD dev. Every time I've been on a FreeBSD system my reaction has been "this is kinda weird, but really nice." (Especially, compiling from ports.) The fact that y'all are connected with academic communities who can solve these problems in more principled ways is really wonderful, as are the BSD/Solaris attitudes of "hey let's wall off things that don't need to interact" and "the kernel is the most important part, but it's not the only thing."


It's worth pointing out that the FreeBSD implementation (on AMD64) only promotes 4kB pages to 2MB pages and doesn't transparently promote to 1GB pages.

Given alc@ was an author on the paper (and the paper's FreeBSD 4.x implementation supported multiple superpage sizes), I'm not really sure why FreeBSD's pmap doesn't have support for 1GB page promotions.


F5/LineRate did it but it got NACKed in a fairly underhanded and unfortunate way on the mailing lists :/ https://github.com/Seb-LineRate/freebsd/commits/seb/stable-1...


That patch set does not implement transparent creation of 1GB mappings. It also contains dubious things like this, which make me think the branch was a WIP: https://github.com/Seb-LineRate/freebsd/commit/66a8d3474d410...

The only mailing list thread I see regarding this is here, and it doesn't seem particularly underhanded to me: https://lists.freebsd.org/pipermail/freebsd-hackers/2014-Nov...


All the technical critique seems fair but it seemed like they (both as an individual and as a company) were a first time contributor and no outreach was really done to pull them in further. I guess LineRate imploded within F5 so there could have been structural problems inside there prevented them from doing a fully baked contribution anyway.


Can you clarify what you mean by "underhanded and unfortunate way?" The thread I can see on freebsd-hackers@ had a little feedback, including some very valid critiques (the code was developed against 9.x, then ported to 10.x without testing, at a time when 11 was CURRENT) and the original author just didn't follow up at all:

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-Nov...

But maybe I'm missing something?

See also: https://lists.freebsd.org/pipermail/freebsd-hackers/2013-Sep...


I've had a really bad run-in with transparent hugepage defragmentation. In a workload consisting of many small-ish reductions, my programme spent over 80% of its total running time in pageblock_pfn_to_page (this was on a 4.4 kernel, https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#...) and a total of 97% of the total time in hugepage compaction kernel code. Disabling hugepage defrag with echo never > /sys/kernel/mm/transparent_hugepage/defrag lead to an instant 30x performance improvement.

There's been some work to improve performance (e.g. https://github.com/torvalds/linux/commit/7cf91a98e607c2f935d... in 4.6) but I haven't tried if this fixes my workload.


Did you try allocating hugepages statically at startup? This will also remove the fragmentation.


The algorithm was implemented in a big data framework that handles the allocations, so I would have needed to significantly adapt its memory subsystem to change this. I've talked to the authors, though, and it's not easy to change. Easier to disable transparent hugepage defrag, especially when there's a paper deadline to meet :)


So glad this is on the front page of HN. A good 30% of perf problems for our clients are low level misconfigurations such as this. For databases: huge pages - good THP - bad


Not to mention that there was a race condition in the implementation which would cause random memory corruption under high memory load. Varnish Cache would consistently hit this. Recently fixed:

https://access.redhat.com/documentation/en-us/red_hat_enterp...


Agreed. Found this to be a problem and fixed it by switching it off three years ago. Seems to be a bigger problem on larger systems than small systems. We had a 64-core server with 384GB RAM, and running too many JVMs made the khugepaged go into overdrive and basically cripple the server entirely - unresponsive, getting 1% the work done, etc.


I stumbled upon this feature when some Windows VMs running 3D accelerated programs exhibited freezes of multiple seconds every now and then. We quickly discovered khugepaged would hog the CPU completely during these hangs. Disabling THP solved any performance issues.


KVM?


VMware 12.5.x


<on-the-clock>Do you mind opening a support ticket for this with VMware? You can't be the only person seeing this, and it'd be great for us to check for this specifically when dealing with mystery-meat "bad perf in XYZ VM" bugs.</on-the-clock>


Bad advise... The following article is much better at actually measuring the impact:

https://alexandrnikitin.github.io/blog/transparent-hugepages...

Especially the conclusion is noteworthy:

> Do not blindly follow any recommendation on the Internet, please! Measure, measure and measure again!


I do not agree much with this conclusion. If you can't measure very well, the safe bet is to disable THP because they are capable of improving of a given percentage the performance on certain use cases, but can totally destroy other use cases. So when there is not enough information the potential gain/lose ratio is terrible... So I would say "blindly disable THP", unless you can really go to use-case-specific costly measurement activities and are able to prove yourself that in your use case THP are beneficial.


It's much worse than that though because this isn't a case of measure throughput with, then without and see which is best. Rather, your application is sailing toward a submerged iceberg that when it hits (could be next week) will stall your process, and potentially the entire box, for 60 seconds.

And it doesn't print a message like "yeah I stalled your box for the last 60 seconds in order to shuffle deckchairs around, sorry" in syslog.

So you pull your hair out trying to figure out why your nice stable service all of a sudden sets off Nagios at 2am for no obvious reason, every week or two.


As a counterpoint, consider that random recommendations from the internet can easily get outdated.

So apparently, transparent hugepages have some issues in their current implementation that can cause big performance losses in some cases. Seems to me like that's a bug, and I see no reason why that bug couldn't be fixed in the future.

By following random recommendations, you get into situations where the underlying problem has been fixed for ages, but people still cargo-cult some workaround that actually makes things worse with the new implementation.


If you can't measure (very well?), how would you know the improvement in a certain use-case exists or not?


Indeed. If you can't really measure the difference, then I'd say setting it either way probably doesn't matter anyway.


More like : if you can't measure the difference then definitely turn it off because if it is on there is a non-zero chance of significant instability events in your future.


If I'm understanding their comments correctly it's because the downside isn't just a possible not-increase/decrease in performance it's instability and unpredictable behavior. I worry that it could translate into those vague and difficult to reproduce "the application is weird/slow" reports.

Of course you could profile and measure performance to determine if the warning is applicable but is that something I should be doing for every part of the stack? I should but should I prioritize that over x, y or z?


If you apply this same reasoning consistently to every other performance optimization i doubt you'll be left with a quick computer.


"Don't change any defaults unless you can do extensive use-case-specific measurements" is probably the best way to keep a computer quick, I would say.


Quick! Download that registry cleaner that promises you to get your machine up to lightning speed!


I would also say the same if you host a Ruby or Python app, or anything using forking really.

Similar to the issues you had with Redis, the kernel change to THP on by default totally destroyed CoW sharing for forked Ruby processes, despite Koichi Sasada's change to make the GC more CoW friendly. Without disabling THP, a single run of GC marking can cause the entire heap to be copied for the child.


i feel the same, you should only use these kind of performance improvements if you MUST, not just to gain speed willynilly. Speed always comes at a price, and if it's not needed , then it;'s not needed. Faster is not always better!


Do not blindly follow any recommendation on the Internet, please! Measure, measure and measure again!

It's also important to measure in your actual use-case, and not just with benchmarks that seem "close enough"; I know it sounds odd, but I've seen others adjust settings and then prove that it worked with a benchmark that they claim is "representative", when in reality they didn't actually improve anything because that "representative benchmark" differed from the real use case in precisely the way that would not respond to the adjustment.

Blindly following "best practices" is bad enough, but "proving" that the changes work with crucially-different benchmarks is worse; and when it's some expensive consultant doing such things, I think it may even approach fraud.


I agree, with the caveat in the case of THP to disable it by default. And then measure to prove it’s worth enabling. Or even better, set the setting to ‘madvise’ and let applications decide whether they want huge pages or not.

It baffles me that THP became enabled by default (is it? I think it’s only a default on RHEL distros?). It really screws up many expectations that applications might assume about memory behavior (like the page size). In the majority of cases, THP is a bad, bad idea and anyone with perf or devops experience will agree with this I think.

Do you want to impose GC like pause characteristics to all processes on your box? And possibly double, triple, or 10x your memory usage? Enable THP then.


> I think it’s only a default on RHEL distros?

Since "enabled=always" is the kernel default value, anything that uses a stock kernel (example: Arch family) or has to build its own (Gentoo) will probably have it enabled by default.

I just checked, and my Gentoo and Manjaro systems have it set to "enabled=always".


It's enabled by default because it actually works fine in most cases, it has issues with certain workloads (databases, hadoop etc) where you'll do much better if you allocate Huge Pages (not to be confused with THP) region in advance.

Anyway, recently they added new "defer" mode for defragmentation so THP doesn't try to defragment (the main cause of the slow down) upon allocation and instead it is triggering it in background (via kswapd and kcompactd). This is now set to be the default. I think it is available in RedHat/CentOS 7.3+


Depends what you mean by "works fine". imho any feature that can sent the kernel off into a tens of seconds dream state underneath my process is just unforgivable and totally broken. Definitely good to hear that this is being done out of band now.


I guess what I said is that most of the time latency is not as important as throughput. And in those scenarios it generally works fine.

The best out of both worlds though (although then it requires more manual work) is pre-allocating HugaPages in advance and then let application use them (if the application supports it) or through libhugetlbfs (if it doesn't).

Edit: changed hugetlbfs to libhugetlbfs, so it's easier to find how to do it with man libhugetlbfs


(is it? I think it’s only a default on RHEL distros?)

I can confirm that neither my Ubuntu nor Debian servers have is in "always", they're either "madvise" or "never".


FWIW, Debian 9 (stretch) has it set to "madvise", but my Debian unstable machine has it to "always". Looking further, I can see that /boot/config-4.12.0-1-amd64 has:

  # CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
  CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
and /boot/config-4.13.0-1-amd64 has:

  CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
  # CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
So this is a recent change.

Edit: The linux kernel source says the default is always (in mm/Kconfig), and that's been true since 2011.

The debian package changelog says the change occurred in 4.13.4-1:

  * thp: Enable TRANSPARENT_HUGEPAGE_ALWAYS instead of
    TRANSPARENT_HUGEPAGE_MADVISE
The reason is not given in the changelog itself, but it's given in the git log of the debian packaging:

As advised by Andrea Arcangeli - since commit 444eb2a449ef "mm: thp: set THP defrag by default to madvise and add a stall-free defrag option" this will generally be best for performance.

https://anonscm.debian.org/cgit/kernel/linux.git/commit/debi...

Edit 2: The mentioned commit (444eb2a449ef) dates back to 4.6, so presumably, at least some performance issues with transparent huge pages may be gone since that version of the kernel.


Interesting. I'm running Debian unstable, and recently my system would sometimes lock up under heavy memory pressure. I'm using VirtualBox, which has its own kernel module, so I can't be sure Linux itself is to blame, but the timing seems to coincide with when I switched to that kernel version. Maybe transparent hugepages uncovered a VirtualBox bug or even a kernel bug. And I care about worst case performance more than average performance, so I just now set it to "never".


On Ubuntu 14.04/amd64 I see:

$ cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never

I can also confirm that my Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-101-generic x86_64) droplet has the same setting.


The author opened a bug in Ubuntu about this here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1703742

It is in the process of being fixed (it should be set to madvise).


Quite curious, I'm running Ubuntu 16.04.3 LTS (4.11.0-14-generic x86_64) and have "madvise".


The parent said "droplet" so I assume Digital Ocean. Unless you've installed the host yourself, from scratch, you can't be sure the option hasn't been changed.

Makes me think that your setting is a default and his was set by Digital Ocean.


No, I have another Ubuntu 16.04 machine at home - same kernel version, same settings. He must have installed kernel 4.11 manually , because linux-image-generic currently pulls kernel 4.4.0-101-generic on 16.04; settings depend on kernel version.


Yes, I find it interesting that DO would change random kernel flags.


I suspect settings actually depend on kernel version: 4.4 -> 'always', 4.11 -> 'madvise'

Perhaps there are issues with 'madvise' in kernels prior to 4.11, so they chose 'always' rather than 'never'.


> Bad advise...

That is exactly the reason I wrote the post! Those advice are based on specific use case, bug or outdated kernel. The jemalloc (Digital Ocean post) case is a good example, it just doesn't (didn't) know about THP https://github.com/jemalloc/jemalloc/issues/243

I can only repeat it: "Measure, measure and measure again!"


Conclusion: If you have 256 GB of RAM then keep transparent hugepages enabled!


Transparent hugepages causes a massive slowdown on one of my systems. It has 64GB of RAM, but it seems the kernel allocator fragments under my workload after a couple of days, resulting in very few >2MB regions free (as per proc buddyinfo) even with >30GB of free ram. This slowed down my KVM boots dramatically (10s -> minutes), and perf top looked like the allocator was spending a lot of cycles repeatedly trying and failing to allocate huge pages.

(I don't want to preallocate hugepages because KVM is only a small part of my workload.)


Shouldn't huge pages be used automatically if you malloc() large amounts of memory at once? Wouldn't that cover some of the applications that benefit from it?


I think the more fundamental issue is that transparent huge pages can break the assumed (key word) interface between the application and the system, and this can manifest in very non-obvious ways. Only the application knows the workload characteristics for using the pages it has reserved, and THP messes with assumptions you might be making otherwise. This turns it into a pessimization, overall.

For example, if I map a bunch of pages of memory, use some, and then set MADV_DONTNEED on just a few of those pages afterwords (so they can be given back to the system temporarily), this will only work if I know the entire page is unused. If a page magically gets 512x larger under my feet, it's possible it will "coalesce" with another page -- into a huge page -- that can't be given back, because some of the (now much larger) page is still needed. This is the case of what happens with `jemalloc` + `redis`, where what looks like a memory leak is actually a failure to give back pages to the operating system, because small pages coalesce into huge ones automatically, defeating MADV_DONTNEED.

Whether or not malloc(2) uses THP "under the hood" in this case is more an implementation decision (there are many malloc variants), but it just punts the problem down the road. Ultimately it manifests as a violation of the "unspoken protocol" between the kernel and application when managing memory mappings.

All that said, there's some posts elsewhere in this thread pointing out that FreeBSD's Huge Page implementation avoids some of these deficiencies (possibly as a trade for some of its own), so all that said -- there's still room for better implementations, clearly: https://www.cs.rice.edu/~druschel/publications/superpages.pd...


malloc() is higher level than what the kernel does.

At the lower syscall level you move the BRK address, which is the highest memory address you're allowed to use. By default this is just after the statically initialized memory.

malloc() is just a library that manages this memory for you.

Linux has no idea if you will use the memory you just allocated, usually this happens dynamically; when you access a memory region for the first time, it is allocated in memory for real.


Modern allocators use mmap(2).


They do? Quite interesting. Tho it shouldn't really change the underlying mechanism; map only pages the process has touched.


May add, if you mmap(2) a memory segment (even a very large one, say 1Tb), nothing happens with regard to page mapping. It is not uncommon to see java tomcat processes allocating north of 80 Gb. But only a small percentage of these is actually used.


Well, the page tables are initialized. Those aren't totally free, especially if a large mapping uses 4k pages.


Brendan Gregg's presentation at re:Invent today reflected this advice. Netflix saw good and bad perf so switched back to madvise.


good article, though as other posters suggest, just use it if you obsolutely must, and measure / test the results for any issues!


What's the recommendation on a desktop for gaming / browsing / compiling with 32gb of ram ?


Leave it on (or whatever it is the default).

The issue happens on specific workloads (databases, hadoop etc) and (this is often not mentioned) after the system is running uninterrupted for a quite of while. The slow down comes due that the workloads mentioned cause memory to be fragmented and when kernel tries to defragment the memory (unsuccessfully) on each allocation.

Since the workload you mentioned looks like it is for a workstation that won't be running a database 24/7 over months/years, you are very unlikely to run into it.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: