> typical advice in Java-land is to have a young generation in the 5-10 GiB rang...

yminsky · on April 11, 2015

Here's an example of someone from Linked-In describing a "low latency" setup with a 6GiB heap.

https://engineering.linkedin.com/garbage-collection/garbage-...

I've heard similar things from folks at Twitter, IIRC. But I do find the whole thing kind of mysterious, I have to admit. I'd love to learn that I was wrong.

bjourne · on April 11, 2015

I don't think that person knows what he is talking about. Either you pick high-throughput or low-latency. You don't get both. They got the young generation collection pause down to 60 ms which is completely crap. :) The gc for Factor which I've been hacking on has a young generation pause of 2-3 ms (depending on object topology, generation sizes and so on). But the young generation is only 2mb so the number of gc pauses is significantly higher.

chipsy · on April 11, 2015

Although it's probably not at all what that author meant, there is a way in which throughput and latency have a kind of fractally-layered relationship. For example, if you have a distributed computation that requires all parts to be finished before any result is returned, your throughput is dependent on the average latency for each part being consistently low. If one of them gets stuck it becomes a huge bottleneck, and speeding up the rest won't matter. And at the lowest levels of optimization a similar concern arises when CPU throughput is ultimately concerned with avoiding unnecessary latency such as cache misses.

For modern systems, latency seems to be increasingly important, as the fast paths have all grown various asynchronous aspects. For some systems 60ms pause might be fine, for others 2ms might be way too much. It's definitely a "what scale are you thinking about" kind of thing.

bjourne · on April 11, 2015

What the author (and I) meant by "low-latency" is a gc were the work is very fairly distributed among the mutators allocation requests. The gc with highest throughput just allocates memory and never frees it. But then you sacrifice other properties such as extremely high memory usage..

tmd83 · on April 12, 2015

I think what he meant is an application that is serving a lot of requests at low latency which would be the ideal scenario for a api/cache server so not much of crap. And I'm not sure why you are comparing a 6G new gen to a 2MB new gen. Do you mean to say that a 70ms GC for a 6G heap is too low? It is fairly possible to hit those or even lower range for a heap of that size depending on the data. I have even heard of people hitting even lower GC pause though I myself haven't been able to do that personally.

istvan__ · on April 11, 2015

In fact low latency and high-throughput are usually best friends. You cannot maintain high throughput if your operations are taking longer. Also have a look to this picture. Using G1, doing few hundred MB/s.

http://postimg.org/image/gms25ibnl/

the8472 · on April 11, 2015

Throughput is generally measured in the fraction of CPU cycles spent on GCing. The Parallel Old Gen collector is more efficient in that regard (i.e. provides more compute-throughput) than CMS, Zing or G1, but it is not concurrent and thus you have longer STW pauses compare to the latter collectors.

The concurrent collectors trade some throughput for latency by occupying additional threads for concurrent work. Due to the additional synchronization actions (more atomic instructions, read/write barriers, additional cleanups during shorter STW pauses) they are less efficient in overall CPU cycles spent.

So it certainly is a tradeoff.

Of course a collector that can burn more memory bandwidth and CPU cycles will generally lead to lower pause times, so in that sense increasing throughput of the collector is good for latency. But increasing the collector's throughput leaves less resources for your actual application, decreasing effective throughput.

istvan__ · on April 11, 2015

I don't think that you are wrong, most of the JVM users are not forced to look into how GC works unless they are exposed to extreme high scale and load like LinkedIN, Twitter, etc. It is not uncommon to roll with 6GB+ heap. GC gets in your way if this high scale meets with low latency requirements for p99 (and above) latency, again which these guys care about a lot.

I think the best approach to GC based memory management is what Erlang does, extremely limited scope, no global locking and tiny GC time. I am not entirely familiar how the OCaml VM works, just started to play with the environment. Also, my understanding is that OCaml is not for highly concurrent systems. Anyways it is kind of offtopic here.

The summarize:

- JVM GC details are extremely important for high throughput low latency systems at scale, as far as I know the G1 GC is used for predictable low latency compactions, and I can verify that with my experiments, having 10ms GC pauses

- I think the Erlang approach is superior to garbage collected systems, but it requires no shared memory across your threads (or in the Erlang case processes), so the GC scope is tiny (and few other niceties in BEAM)

More on Java G1: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Ru... More about Erlang's GC here: http://prog21.dadgum.com/16.html

dzderic · on April 11, 2015

A large young gen makes sense since it reduces the frequency of minor GCs but doesn't affect the duration of them, since the running time of a young gen GC is only proportional to the amount of live objects.

bjourne · on April 11, 2015

Yes it does affect the duration of them. The larger nursery you have, the more gc roots you will need to trace due to write barriers from older generations. So yes you are right that the duration is dependent on the number of live objects at gc time, but the number of live objects is also dependent on the size of the nursery.

(and whoever is down-voting me, maybe you can explain why I'm wrong?)

legoviking · on April 11, 2015

My understanding is that an entry in the card table is set when an object in young generation is allocated that is referenced by something in the old generation. An entry in the card table corresponds to a 512 byte segment of memory in the old generation. Thus, the cost imposed by this would be based on how many distinct 512 byte segments of the old generation reference any objects in the young generation.

If you have a web service that mostly consists of some baseline of long-lived objects and many short-lived objects used for fulfilling requests, I would expect to have relatively few GC roots. At that point, if you assume that you have a consistent request rate, I would expect the number of reachable objects in the young generation to remain constant regardless of the size of the young generation, and the number of GC roots should also remain constant. Based on that, increasing the young generation size would then decrease the frequency of young generation garbage collection, reduce the probability of survivors getting promoted to old generation, and have no effect on the time it takes to do young generation garbage collection. There certainly applications that have different behavior when the old generation is less static, but I would think for this use case the new generation size should be as big as it can be.

If something I've said is incorrect or incomplete, I'm anxious to know. There are relatively few well-written explanations of how Java garbage collection works, so it is difficult to have confidence regarding it without a lot of practical experience as you have said.

bjourne · on April 11, 2015

That's a good explanation of why I'm wrong. Basically you are hoping to reach an equilibrium situation in which 0% of the allocated memory in nursery are true survivors. Because if the true survival rate was higher than 0%, then the larger the nursery size the longer the duration between collections and the higher the number of objects that are true survivors.

If you had a perfect situation like that, with a giant nursery, you wouldn't even need to gc anything. When the nursery is full, just start over from address 0 and you can be confident that when the new objects starts overwriting the old that the old will already be unreachable from the object graph.

You never reach that situation in reality. Even in a simple web server some request handling thread might do something innocuous like setting a key in a cache hash somewhere leading to the hash being full and needing to be reallocated. That would dirty mark one card. And again, the longer the duration, the more of these "freak" events you get. Or there may be a string somewhere that keeps track of the current date and when it ticks over from "July 31st, 2015" to "August 1st, 2015" it triggers a reallocation because the last string is one character longer.

It may be that having a large nursery is a good trade-off because for many loads it's the same cards being marked over and over again. That may outweigh the increased frequency of tenured generation collections (memory isn't free so you must take the space from somewhere).

benjiweber · on April 11, 2015

Not an expert but my experience/basic understanding

The old generation gets at lot more expensive as it gets bigger, and I think requires at least some stop the world with all the collectors in hotspot.

New generation collections often remain quick as the size grows as long as most objects are dying young. Increasing the size of new also gives more opportunity for objects to die before being promoted (if you have lots of objects that live just long enough to be promoted it can be a good strategy to increase size of new). New can be collected concurrently.