Tracking Java native memory with JDK flight recorder

gunnarmorling · 2023-12-18T08:37:16 1702888636

Author of the post here, so nice to see it being discussed here. For folks interested to learn more about JFR, here are a few other post on that topic:

* https://www.morling.dev/blog/finding-java-thread-leaks-with-...: Discusses how to find thread likes with JFR and JFR Analytics, a project I've created for querying recordings with SQL

* https://www.morling.dev/blog/towards-continuous-performance-...: Discusses how to use JFR for continuous performance testing, by means of asserting "proxy metrics" such as allocation rates and IO

* https://www.morling.dev/blog/rest-api-monitoring-with-custom...: Discusses how to create your own application-specific JFR events

twic · 2023-12-18T00:24:25 1702859065

> the one thing which NMT does not report, despite what the name might suggest, is any memory allocated by native libraries, for instance invoked via JNI

If you're using glibc, then malloc does have information about that, and provides ways to read it, so it's a shame this isn't exposed. It would be quite helpful in the face of suspected native library memory leaks.

strangemonad · 2023-12-18T00:54:51 1702860891

jemalloc + memleak and perf work pretty well in that case. I think you could do something similar with tcmalloc

planede · 2023-12-18T15:11:43 1702912303

If we are talking replacing the libc allocator, then something like heaptrack is worth mentioning.

https://github.com/KDE/heaptrack

pron · 2023-12-18T00:18:09 1702858689

For summary views:

    $ jfr view native-memory-reserved rec.jfr

and

    $ jfr view native-memory-committed rec.jfr

https://x.com/ErikGahlin/status/1736530559231201484

kkcorps · 2023-12-18T05:39:53 1702877993

debugging native calls in itself is also painful. I have switched to using async-profiler (https://github.com/async-profiler/async-profiler) instead of JFR for most of my usecases.

A. it tracks native calls by default B. it can track wall time as well C. you can have neat interactive flamegraphs

Sarkie · 2023-12-18T08:04:24 1702886664

I can't believe we need to use async profiler tbh.

It should all be in jfc/jmc

aardvark179 · 2023-12-18T09:50:48 1702893048

I hear there are plans to fix that.

In the meantime you can get async profiler to output in JFR format and combine it with a separate JFR recording.

Sarkie · 2023-12-18T17:49:38 1702921778

Got any links by chance?

vinay_ys · 2023-12-20T08:17:12 1703060232

In the past with Java 8 JFR wasn't free for use in commercial production environments. So we built our own profilers. It's nice that with NFTC, Oracle allows you to use JFR in production from Java 11 onwards. But they can revoke NFTC anytime. So, it is still wise to develop and use open-source or private alternatives to JFR. Besides JFR still doesn't profile everything well and you still need to build infrastructure for integrated distributed profiling and tracing.

vanillax · 2023-12-18T01:03:14 1702861394

whats a real world use case for this? Been doing big enterprise java for a decade and never ran into a scenario where I would need this. Don't get me wrong, I think theres most certainly value for this, but usually doing plain ole Spring Boot or the dreaded Adobe Experience manager, things just "work". So just honestly curious what problem this would help me solve? Im guessing if you develop a platform you need performance and need to find weak points in your platform? Or is this for building out tools like Dynatrace / New Relic.

syntacticbs · 2023-12-18T08:34:02 1702888442

I've worked in applications where a lot of IO is required. If performance is something your application cares about then you'll probably end up using direct ByteBuffers which are off heap and you'll likely want to set a sensible value for: -XX:MaxDirectMemorySize

However if this value ever does get exceeded, you need some way of tracking down what allocations happened prior to your OutOfMemory exception.

Though the above article implies the sampling rate is once a second (I guess there's some cost to increasing that rate). Usually you won't be allocating direct memory on the reg since it's expensive to allocate and deallocate relative to heap memory so you kind of want to capture ALL allocations and deallocations. As such a sample based approach is not ideal due to possibly missing some data between samples.

marginalia_nu · 2023-12-18T13:48:15 1702907295

> I've worked in applications where a lot of IO is required. If performance is something your application cares about then you'll probably end up using direct ByteBuffers which are off heap and you'll likely want to set a sensible value for: -XX:MaxDirectMemorySize

It's also worth noting that if your Xmx is larger than 32 GB, you can't use CompressedOOPs, which is a bad deal. Off-heap memory lets you skirt that limitation while still allocating >32GB.

MrBuddyCasino · 2023-12-18T09:29:47 1702891787

> If performance is something your application cares about then you'll probably end up using direct ByteBuffers

What made you decide to go this route, instead of pre-allocating a pool of buffers (possibly thread-local) and recycling them?

syntacticbs · 2023-12-18T10:28:47 1702895327

Pre-allocating up front would also have been valid way of doing things.

For the specific use cases I have worked on though, I'm not sure it would have had any additional performance benefits. The specific class of systems I've worked on usually have only a few sockets which open at the start of a day and stay connected until the end of a day. Pre-allocating would make the socket opening process faster but that is not usually the part of an application life cycle that needs optimising. If there were lots of sockets opening and closing or the speed of opening and closing needed optimisation, then what you suggested would be a good idea.

vanillax · 2023-12-18T14:18:50 1702909130

thanks this is helpful!

bzzzt · 2023-12-18T06:16:19 1702880179

I’ve encountered native memory issues in those “enterprise” apps. Lots a frameworks set an upper bound and if you load one library too much you can run out of JIT compiler cache leading to constant compilation which tanks performance. More ways to get insights can’t hurt.

misja111 · 2023-12-18T08:34:44 1702888484

That's a good question. I can't imagine that you'd need to use this to track normal garbage collected memory. There were pretty good tools already in place for that.

I guess what you'd want to use this for, is when your application is directly allocating memory, e.g. via direct byte buffers. That's not something you'd do in an enterprise application, it's more something you'd need for high performance image processing, or maybe for some extremely high performance web server.

brabel · 2023-12-18T07:44:46 1702885486

Using offheap memory allocation in Java is generally a bad idea. Java is a GC'd language. If you need to manage memory manually, use a language designed for that. The lengths people go to NOT learn another language (while having to learn stuff like this and do primitive memory management in your application code - which itself may use the GC if you're not careful) amazes me.

kaba0 · 2023-12-18T09:13:22 1702890802

It’s the same as using “unsafe” in rust, it shouldn’t be necessary by default, but when you do need it, you can have a well-defined part that safely encapsulates all the logic of dealing manually with memory. The end result will be a safe program with the assumed memory/performance-bottleneck of doing it naively solved.

xxs · 2023-12-18T10:36:41 1702895801

>Using offheap memory allocation in Java is generally a bad idea.

Memory mapped files and direct buffers do work well. I'd never consider it a bad idea; technically both would be freed if there are no strong references to them (not much different than allocating massive arrays).

Think of it like that the jdk standard library has to use native memory for any IO or even a trivial zip compression. ByteBuffers were introduced in 1.4 (around 20y back) to allow operations outside the managed system, the tools are available to anyone, e.g. implementing zstd with direct buffers is not hard.

marginalia_nu · 2023-12-18T13:55:25 1702907725

Unfortunately, you do pay a fairly large price for bytebuffers, even off-heap ones.

I saw a pretty jawdropping speedup moving from bytebuffers to the new foreign memory stuff with in the marginalia search index code. Most operations run about 50-100% faster, even discounting the rigmarole needed to deal with mmapping >2 GB files. Haven't dug too deep into why this is so I'm not entirely sure why, may have something to do with being able to declare non-shared memory ranges in the arena allocator, saves a bunch of synchronization maybe?

I can't wait for this stuff to leave experimental. It's such a quality of life boon for dealing with off-heap memory, having explicit lifecycle control is.

xxs · 2023-12-18T14:23:34 1702909414

> Unfortunately, you do pay a fairly large price for bytebuffers, even off-heap ones.

I'd disagree. Heap based bytebuffers - I don't consider them interesting, they are just byte arrays. The direct ones suffer from unmapping (munmap on linux) - it's a slow process that has to flush the TLB. So the only sane way to use them allocate once/reuse. Anything else I'd consider a programmer error.

mmap on large files indeed does suck (a bit) w/ ByteBuffers as it'd require an array of them.

Personally I have been using direct buffer since 1.4, so I guess I am also quite used to them as well.

marginalia_nu · 2023-12-18T14:33:12 1702909992

Right, I'm not talking about mapping or allocation costs or anything surrounding the access, just the impact of addressing off heap bytebuffers them seems quite a bit slower than what you get with the new foreign memory API... for whatever reason.

xxs · 2023-12-18T15:04:48 1702911888

You mean using directBuffers in java - that could be, if the JIT fails to remove the bound checks.

It should never be an issue in native code, of course. But yes - it's possible that it happens with Java code. In that case you'd need print assembly and looking at the generated code.

Some libraries have 'switched' to straight unsafe use (which has no bound checks, of course), so there is that.

bitcharmer · 2023-12-18T23:02:26 1702940546

This makes zero sense. I'll believe it when someone proves that (warmed up) direct byte buffer access is slower than Unsafe.

marginalia_nu · 2023-12-18T23:06:41 1702940801

I'm not comparing against Unsafe, this is JDK21. No unsafe anymore.

bitcharmer · 2023-12-18T23:00:28 1702940428

> I saw a pretty jawdropping speedup moving from bytebuffers to the new foreign memory stuff

That's just a sign of bad benchmark. Off-heap BBs and foreign memory are both native (non-JVM) heap. They are the same thing

bzzzt · 2023-12-18T08:33:31 1702888411

While I fully agree as a rule you shouldn't have to do manual memory management in Java, the functionality is there and useful in exceptional situations. I've seen integrations where there's no '100% pure Java' implementation available, so you have to choose between picking a battle-tested native library or reimplementing from scratch in Java (which sometimes isn't even possible with closed source systems)

pjmlp · 2023-12-18T08:45:09 1702889109

It is called taking your cake and eating it too, in terms of productivity, IDE tooling and library ecosystem, for the same reasons ML frameworks use Python with bindings to C++ libraries, instead of being 100% written in C++.

DarkmSparks · 2023-12-18T00:37:12 1702859832

been a big fan of flight recorder since it was part of jrocket & mission control iirc. great to see it making something of a come back.

the-smug-one · 2023-12-18T00:54:37 1702860877

Here's some more NMT usage without flight recorder: https://blog.arkey.fr/2020/11/30/off-heap-reconnaissance/