Please be aware that the article describes a problem with a specific implementation of THP. Other operating systems implement it differently and don't suffer from the same caveats (though any implementation will of course have its own disadvantages, since THP support requires making various tradeoffs and policy decisions). FreeBSD's implementation (based on [1]) is more conservative and works by opportunistically reserving physically contiguous ranges of memory in a way that allows THP promotion if the application (or kernel) actually makes use of all the pages backed by the large mapping. It's tied in to the page allocator in a way that avoids the "leaks" described in the article, and doesn't make use of expensive scans. Moreover, the reservation system enables other optimizations in the memory management subsystem.
Hey, thanks for being a FreeBSD dev. Every time I've been on a FreeBSD system my reaction has been "this is kinda weird, but really nice." (Especially, compiling from ports.) The fact that y'all are connected with academic communities who can solve these problems in more principled ways is really wonderful, as are the BSD/Solaris attitudes of "hey let's wall off things that don't need to interact" and "the kernel is the most important part, but it's not the only thing."
It's worth pointing out that the FreeBSD implementation (on AMD64) only promotes 4kB pages to 2MB pages and doesn't transparently promote to 1GB pages.
Given alc@ was an author on the paper (and the paper's FreeBSD 4.x implementation supported multiple superpage sizes), I'm not really sure why FreeBSD's pmap doesn't have support for 1GB page promotions.
All the technical critique seems fair but it seemed like they (both as an individual and as a company) were a first time contributor and no outreach was really done to pull them in further. I guess LineRate imploded within F5 so there could have been structural problems inside there prevented them from doing a fully baked contribution anyway.
Can you clarify what you mean by "underhanded and unfortunate way?" The thread I can see on freebsd-hackers@ had a little feedback, including some very valid critiques (the code was developed against 9.x, then ported to 10.x without testing, at a time when 11 was CURRENT) and the original author just didn't follow up at all:
I've had a really bad run-in with transparent hugepage defragmentation. In a workload consisting of many small-ish reductions, my programme spent over 80% of its total running time in pageblock_pfn_to_page (this was on a 4.4 kernel, https://github.com/torvalds/linux/blob/v4.4/mm/compaction.c#...) and a total of 97% of the total time in hugepage compaction kernel code. Disabling hugepage defrag with echo never > /sys/kernel/mm/transparent_hugepage/defrag lead to an instant 30x performance improvement.
The algorithm was implemented in a big data framework that handles the allocations, so I would have needed to significantly adapt its memory subsystem to change this. I've talked to the authors, though, and it's not easy to change. Easier to disable transparent hugepage defrag, especially when there's a paper deadline to meet :)
So glad this is on the front page of HN. A good 30% of perf problems for our clients are low level misconfigurations such as this.
For databases:
huge pages - good
THP - bad
Not to mention that there was a race condition in the implementation which would cause random memory corruption under high memory load. Varnish Cache would consistently hit this. Recently fixed:
Agreed. Found this to be a problem and fixed it by switching it off three years ago. Seems to be a bigger problem on larger systems than small systems. We had a 64-core server with 384GB RAM, and running too many JVMs made the khugepaged go into overdrive and basically cripple the server entirely - unresponsive, getting 1% the work done, etc.
I stumbled upon this feature when some Windows VMs running 3D accelerated programs exhibited freezes of multiple seconds every now and then. We quickly discovered khugepaged would hog the CPU completely during these hangs. Disabling THP solved any performance issues.
<on-the-clock>Do you mind opening a support ticket for this with VMware? You can't be the only person seeing this, and it'd be great for us to check for this specifically when dealing with mystery-meat "bad perf in XYZ VM" bugs.</on-the-clock>
I do not agree much with this conclusion. If you can't measure very well, the safe bet is to disable THP because they are capable of improving of a given percentage the performance on certain use cases, but can totally destroy other use cases. So when there is not enough information the potential gain/lose ratio is terrible... So I would say "blindly disable THP", unless you can really go to use-case-specific costly measurement activities and are able to prove yourself that in your use case THP are beneficial.
It's much worse than that though because this isn't a case of measure throughput with, then without and see which is best. Rather, your application is sailing toward a submerged iceberg that when it hits (could be next week) will stall your process, and potentially the entire box, for 60 seconds.
And it doesn't print a message like "yeah I stalled your box for the last 60 seconds in order to shuffle deckchairs around, sorry" in syslog.
So you pull your hair out trying to figure out why your nice stable service all of a sudden sets off Nagios at 2am for no obvious reason, every week or two.
As a counterpoint, consider that random recommendations from the internet can easily get outdated.
So apparently, transparent hugepages have some issues in their current implementation that can cause big performance losses in some cases. Seems to me like that's a bug, and I see no reason why that bug couldn't be fixed in the future.
By following random recommendations, you get into situations where the underlying problem has been fixed for ages, but people still cargo-cult some workaround that actually makes things worse with the new implementation.
More like : if you can't measure the difference then definitely turn it off because if it is on there is a non-zero chance of significant instability events in your future.
If I'm understanding their comments correctly it's because the downside isn't just a possible not-increase/decrease in performance it's instability and unpredictable behavior. I worry that it could translate into those vague and difficult to reproduce "the application is weird/slow" reports.
Of course you could profile and measure performance to determine if the warning is applicable but is that something I should be doing for every part of the stack? I should but should I prioritize that over x, y or z?
I would also say the same if you host a Ruby or Python app, or anything using forking really.
Similar to the issues you had with Redis, the kernel change to THP on by default totally destroyed CoW sharing for forked Ruby processes, despite Koichi Sasada's change to make the GC more CoW friendly. Without disabling THP, a single run of GC marking can cause the entire heap to be copied for the child.
i feel the same, you should only use these kind of performance improvements if you MUST, not just to gain speed willynilly. Speed always comes at a price, and if it's not needed , then it;'s not needed. Faster is not always better!
Do not blindly follow any recommendation on the Internet, please! Measure, measure and measure again!
It's also important to measure in your actual use-case, and not just with benchmarks that seem "close enough"; I know it sounds odd, but I've seen others adjust settings and then prove that it worked with a benchmark that they claim is "representative", when in reality they didn't actually improve anything because that "representative benchmark" differed from the real use case in precisely the way that would not respond to the adjustment.
Blindly following "best practices" is bad enough, but "proving" that the changes work with crucially-different benchmarks is worse; and when it's some expensive consultant doing such things, I think it may even approach fraud.
I agree, with the caveat in the case of THP to disable it by default. And then measure to prove it’s worth enabling. Or even better, set the setting to ‘madvise’ and let applications decide whether they want huge pages or not.
It baffles me that THP became enabled by default (is it? I think it’s only a default on RHEL distros?). It really screws up many expectations that applications might assume about memory behavior (like the page size). In the majority of cases, THP is a bad, bad idea and anyone with perf or devops experience will agree with this I think.
Do you want to impose GC like pause characteristics to all processes on your box? And possibly double, triple, or 10x your memory usage? Enable THP then.
Since "enabled=always" is the kernel default value, anything that uses a stock kernel (example: Arch family) or has to build its own (Gentoo) will probably have it enabled by default.
I just checked, and my Gentoo and Manjaro systems have it set to "enabled=always".
It's enabled by default because it actually works fine in most cases, it has issues with certain workloads (databases, hadoop etc) where you'll do much better if you allocate Huge Pages (not to be confused with THP) region in advance.
Anyway, recently they added new "defer" mode for defragmentation so THP doesn't try to defragment (the main cause of the slow down) upon allocation and instead it is triggering it in background (via kswapd and kcompactd). This is now set to be the default. I think it is available in RedHat/CentOS 7.3+
Depends what you mean by "works fine". imho any feature that can sent the kernel off into a tens of seconds dream state underneath my process is just unforgivable and totally broken. Definitely good to hear that this is being done out of band now.
I guess what I said is that most of the time latency is not as important as throughput. And in those scenarios it generally works fine.
The best out of both worlds though (although then it requires more manual work) is pre-allocating HugaPages in advance and then let application use them (if the application supports it) or through libhugetlbfs (if it doesn't).
Edit: changed hugetlbfs to libhugetlbfs, so it's easier to find how to do it with man libhugetlbfs
FWIW, Debian 9 (stretch) has it set to "madvise", but my Debian unstable machine has it to "always". Looking further, I can see that /boot/config-4.12.0-1-amd64 has:
# CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS is not set
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y
and /boot/config-4.13.0-1-amd64 has:
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
So this is a recent change.
Edit: The linux kernel source says the default is always (in mm/Kconfig), and that's been true since 2011.
The debian package changelog says the change occurred in 4.13.4-1:
* thp: Enable TRANSPARENT_HUGEPAGE_ALWAYS instead of
TRANSPARENT_HUGEPAGE_MADVISE
The reason is not given in the changelog itself, but it's given in the git log of the debian packaging:
As advised by Andrea Arcangeli - since commit 444eb2a449ef "mm: thp: set THP defrag by default to madvise and add a stall-free defrag option" this will generally be best for performance.
Edit 2: The mentioned commit (444eb2a449ef) dates back to 4.6, so presumably, at least some performance issues with transparent huge pages may be gone since that version of the kernel.
Interesting. I'm running Debian unstable, and recently my system would sometimes lock up under heavy memory pressure. I'm using VirtualBox, which has its own kernel module, so I can't be sure Linux itself is to blame, but the timing seems to coincide with when I switched to that kernel version. Maybe transparent hugepages uncovered a VirtualBox bug or even a kernel bug. And I care about worst case performance more than average performance, so I just now set it to "never".
The parent said "droplet" so I assume Digital Ocean. Unless you've installed the host yourself, from scratch, you can't be sure the option hasn't been changed.
Makes me think that your setting is a default and his was set by Digital Ocean.
No, I have another Ubuntu 16.04 machine at home - same kernel version, same settings. He must have installed kernel 4.11 manually , because linux-image-generic currently pulls kernel 4.4.0-101-generic on 16.04; settings depend on kernel version.
That is exactly the reason I wrote the post! Those advice are based on specific use case, bug or outdated kernel. The jemalloc (Digital Ocean post) case is a good example, it just doesn't (didn't) know about THP https://github.com/jemalloc/jemalloc/issues/243
I can only repeat it: "Measure, measure and measure again!"
Transparent hugepages causes a massive slowdown on one of my systems. It has 64GB of RAM, but it seems the kernel allocator fragments under my workload after a couple of days, resulting in very few >2MB regions free (as per proc buddyinfo) even with >30GB of free ram. This slowed down my KVM boots dramatically (10s -> minutes), and perf top looked like the allocator was spending a lot of cycles repeatedly trying and failing to allocate huge pages.
(I don't want to preallocate hugepages because KVM is only a small part of my workload.)
Shouldn't huge pages be used automatically if you malloc() large amounts of memory at once? Wouldn't that cover some of the applications that benefit from it?
I think the more fundamental issue is that transparent huge pages can break the assumed (key word) interface between the application and the system, and this can manifest in very non-obvious ways. Only the application knows the workload characteristics for using the pages it has reserved, and THP messes with assumptions you might be making otherwise. This turns it into a pessimization, overall.
For example, if I map a bunch of pages of memory, use some, and then set MADV_DONTNEED on just a few of those pages afterwords (so they can be given back to the system temporarily), this will only work if I know the entire page is unused. If a page magically gets 512x larger under my feet, it's possible it will "coalesce" with another page -- into a huge page -- that can't be given back, because some of the (now much larger) page is still needed. This is the case of what happens with `jemalloc` + `redis`, where what looks like a memory leak is actually a failure to give back pages to the operating system, because small pages coalesce into huge ones automatically, defeating MADV_DONTNEED.
Whether or not malloc(2) uses THP "under the hood" in this case is more an implementation decision (there are many malloc variants), but it just punts the problem down the road. Ultimately it manifests as a violation of the "unspoken protocol" between the kernel and application when managing memory mappings.
All that said, there's some posts elsewhere in this thread pointing out that FreeBSD's Huge Page implementation avoids some of these deficiencies (possibly as a trade for some of its own), so all that said -- there's still room for better implementations, clearly: https://www.cs.rice.edu/~druschel/publications/superpages.pd...
malloc() is higher level than what the kernel does.
At the lower syscall level you move the BRK address, which is the highest memory address you're allowed to use. By default this is just after the statically initialized memory.
malloc() is just a library that manages this memory for you.
Linux has no idea if you will use the memory you just allocated, usually this happens dynamically; when you access a memory region for the first time, it is allocated in memory for real.
May add, if you mmap(2) a memory segment (even a very large one, say 1Tb), nothing happens with regard to page mapping. It is not uncommon to see java tomcat processes allocating north of 80 Gb. But only a small percentage of these is actually used.
The issue happens on specific workloads (databases, hadoop etc) and (this is often not mentioned) after the system is running uninterrupted for a quite of while. The slow down comes due that the workloads mentioned cause memory to be fragmented and when kernel tries to defragment the memory (unsuccessfully) on each allocation.
Since the workload you mentioned looks like it is for a workstation that won't be running a database 24/7 over months/years, you are very unlikely to run into it.
[1] https://www.cs.rice.edu/~druschel/publications/superpages.pd...