Building a lower-latency GC

jordwalke · on April 10, 2015

Sounds like amazing work is happening here.

For real time user applications such as games, newsfeeds, and mobile apps that have visual/audio feedback, I'm certain that the most important feature is the ability to have applications themselves control the process of collecting and the time constraints to operate within. It is possible to have a collector that is less efficient, but be entirely perceived to be more responsive as long as the wasteful or less efficient work is done at times when it does not matter - and this tradeoff would be gladly welcomed.

So much of UI application time is idle. Even during fluid animations, the majority of the frame time goes wasted. But then something like Objective-C's ARC frees large trees of memory, or perhaps Java's GC collects, and causes your frame time to exceed 16ms. For UI applications, there are periods of time (during animations) where you must hit 16ms deadlines, and the difference between 1ms and 15ms is nothing, but the difference between 16ms and 17ms is everything. UI application animations are like train schedules. If you miss one train by only a microsecond, you still have to wait the entire time until the next train arrives. You don't get points for almost making the train. Furthermore, only the application can know the train schedule. It isn't something the language runtime can possibly know, so to do this right, the language must anticipate this and accept input from the application frameworks.

Then there are other times when we are not performing an animation and we know that we could block a thread for as much as 50ms, without having any perceived delay experienced by the user. The latency constraints for starting a continuous interaction are larger than the constraints for continuing a continuous interaction. So in this case, our application still knows the train schedule, it's just a different train schedule that allows for more time to kill. If applications could tell the GC about this, it might decide that it's a good time to perform a major collection.

I've found that many of what people consider to be performance problems in UI applications aren't problems of efficiency, they're problems of scheduling.

jordwalke · on April 10, 2015

Another anecdote that has less to do with garbage collectors, but still shows the importance frame alignment:

I had been developing a UI interaction for a mobile app, and had been experiencing the strangest stutters while dragging. For the longest time, I blamed the garbage collector (which I realize is the lazy way out - no one will argue with you). However, I eventually realized that the exact same code that executed while dragging could be played on a screen-refresh-driven event and would be as smooth as butter. What was even stranger, is that my two year old iPhone 5 would execute this dragging interaction much more smoothly than my brand new iPhone 6+. Something was going on here clearly.

So I took the garbage collector out of the equation, and just measured the raw hardware events for screen refreshes and touch events and I found that touches may be coming in at a steady 60fps (just like the screen), but they were aligned in the middle of the frame. This was only happening on my higher end device, not on the lower end device which explains why my iPhone 5 felt better than my 6+. This is really bad because if your handler takes a mere 9ms, it will end up boarding the next train, along with the next* touch handler. So zero passengers hop on the first one, and two passengers hop on the next one. What we really wanted was one passenger per train in order to create smooth interactions/animations. What looked like a performance issue, is actually a simple alignment issue! What's worse, is that if you had a naive "fps counter", that must measured how many event handlers completed per second on average, it wouldn't drop below 60fps!

So not only is it important to be able to schedule work (GC or anything really during available frame time), it's important that we properly align this work with the screen refresh rate, both of which are things that can only be done above the language layer.

cnp · on April 10, 2015

Great explanation

chongli · on April 11, 2015

the difference between 1ms and 15ms is nothing, but the difference between 16ms and 17ms is everything. UI application animations are like train schedules. If you miss one train by only a microsecond, you still have to wait the entire time until the next train arrives.

This is why we need a better display protocol. Why are we still treating our displays like synchronous, scanline CRTs? The display should have a simple buffer into which we can do random reads/writes at any time and the affected pixels will be updated immediately. This would solve all of these problems (and eliminate video tearing)!

throwawaymylife · on April 11, 2015

Yes, this would be much better than the status quo, and hopefully technologies like "gsync" will bring something like it to us soon. But this approach will still result in microstutter from frames being displayed at somewhat random times, that may or may not be perceptible (I've never used such a display). Obviously a bit of microstutter is preferable to tearing or missing frames entirely under heavy loads, but really, a synchronous refresh rate would be ideal, if only our process schedulers were designed to give strict priority, more reliable vblank timing, and some degree of safety from preemption to the foreground UI process.

Akkuma · on April 11, 2015

IIRC, Nvidia G-Sync and AMD's Freesync essentially accomplish what you're talking about.

throwawaymylife · on April 11, 2015

Three comments:

This approach can still cause perceptible latency if the user tries to input something while the GC you were trying to hide is ongoing.

For really intense graphical applications, like games, there is never a resting period where you can afford to spend a frame or more of time collecting garbage.

And for simpler graphical applications where you could perhaps get away with this, it's kind of pathetic that programmers today (or rather, their programming languages) even have trouble hitting 60 FPS on modern machines, when you look back at what people were able to accomplish in the past on hardware several orders of magnitude less powerful.

jeffreyrogers · on April 11, 2015

> For really intense graphical applications, like games, there is never a resting period where you can afford to spend a frame or more of time collecting garbage.

Yeah, it's really interesting to hear graphics programmer's opinions on garbage collection. I forget if it was Jon Blow or Casey Muratori who said something like (paraphrasing): "The bottom line is that GC advocates just have different standards for performance."

And while not a GC related thing, but more a performance related one. I find it absurd that when I use the Windows computers at school it takes me upwards of 30 seconds to open a document in Word.

jordwalke · on April 11, 2015

>> This approach can still cause perceptible latency if the user tries to input something while the GC you were trying to hide is ongoing.

With all due respect, I think you're missing the point on this one. For non-games, there are times when blocking for n frames will be perceptible and times when it will not be and we know exactly when these are. Garbage collection (such as ARC) has nothing to do with that particular fact. It doesn't matter what you are doing during those times - you could be performing longer running computation, or garbage collection tasks (such as ARC deallocing large trees, or the JVM). The way this is handled in applications is by either offloading long running tasks to another thread of finding a way to distribute the time across several event loops (whether intentional or not). I'm proposing that the same techniques that we apply to other parts of our apps be applied to garbage collectors.

Also, that threshold for a continuous animation, is absolutely less than the threshold for perceptible instantaneous latency and that is determined by biology, not me. That means you have more time to work with when there is no animation ongoing but one could begin as the result of a tap, (or typing). Now, the OS/platform does consume a large part of this threshold - but that's a different story.

That's a good point about games never having an entire resting frame. But that brings me back to the first desirable GC feature - to provide a way for a GC to use the remaining frame time of thousands of frames, so that it can perform its work incrementally in the remaining time of each individual frame, without ever missing a display link deadline. I think both are important (sub frame yielding to incremental tasks like collections, and yielding entire frames (or two) to longer blocking tasks for applications that can withstand it). Games might not be able to take advantage of the later, but the majority of non-game apps that you have installed on your phone would certainly.

Some would argue that "for application types X there is no available frame time". I would say that is almost always false. Otherwise application type X would drop frames on any machine that was slightly less performant.

throwawaymylife · on April 11, 2015

I do think your logic holds for many combinations of users and infrequently-used applications. If your "brain loop" works like: press a button, then wait to see the result and completely mentally parse the screen before pressing anything else, then yes, there will be a few dozen frames after any action in which the program can safely run a garbage collector. This is the way we all operate when we first start using a new program. But then, for many users, once they've become familiar enough with that program, they no longer need to think about most of the screens, and can rely entirely on muscle memory to navigate.[1] At this point, any additional latency is certainly noticed. The best example would be typing; at least on HN, the vast majority of us can touch type, and any pauses between typing and characters appearing on screen (such as the incessant pauses I experience with Firefox and iOS Safari) are incredibly annoying. And since typing is an important part of almost any application, that basically necessitates a never-ending "continuous interaction," I don't think you can discount the importance of a highly interactive feedback loop for everyday programs.

I don't really see how you were discussing incremental GC in your original post, but yes, incremental GCs designed to scan their heaps in small bursts do exist and are common in embedded languages.

I wish that modern OSes and programming paradigms had better support for signals/"software interrupts", so that, even if I don't think garbage collecting everything is the right strategy for interactive applications, at least you could say "let me run my GC/other background task up until exactly the moment that the next frame starts," instead of having to manually parcel out work into tiny units while poling a timer.

[1] This is a big part of why I prefer "tacky" 90s Windows and UNIX UIs to modern animation-heavy ones. Once I've "learned" the program, I'm not even consciously looking at the UI, so the superfluous animations just force me to add delays of my own to my muscle memory, instead of letting me work as quickly as my muscles can move. Programmers tend to get this right for keyboard-driven interfaces, but they tend to underestimate our ability to use muscle memory for mouse and touch-driven interfaces.

jordwalke · on April 11, 2015

>> I do think your logic holds for many combinations of users and infrequently-used applications.

Do you consider the Facebook app infrequently used? Or virtually every other app in the iOS app store that has a UIControl embedded inside of a scroll view such that highlighting of elements is intentionally delayed? There is certainly a few frames worth of work that can be done while no interruptable animations are occurring and while no interaction is occurring. If you don't believe me, then I suggest you do what I often do: use a very high speed camera and examine your casual usage of an app. You'll find that you are much slower than you think you are.

panic · on April 10, 2015

The latency constraints for starting a continuous interaction are larger than the constraints for continuing a continuous interaction.

I think this is true in general, but there are common cases where you do notice the delay. It's easy to notice a few frames of delay when you start scrolling a web page, for example (which often happens with pages using synchronous touch events). If you want a truly great user experience, you need to make sure your GC never adds more than a few ms of latency. Ideally you'd collect every frame so the latency is consistent.

jordwalke · on April 10, 2015

A few frames on top of the platform/OS/hardware latency may begin to cross into the threshold of perceptions for initiation of continuous movement. If your hardware already has 30ms of latency, then a few additional frames makes approximately 70ms. What you actually see with your eyes when your program consumes 30ms at the start of a scroll may be much larger.

However, you already have a few pixels (not time) of intentionally introduced slop in a scroll view. I'd like the ability to play with all of these constraints (including slop pixel amount), remaining frame time, acceptable latency for initiation of gestures, in order to reach what I consider to be a great user experience.

dada78641 · on April 11, 2015

> If you miss one train by only a microsecond, you still have to wait the entire time until the next train arrives.

This is an apt observation and completely true, except now I can't help but imagine my frames getting stuck in between the closing train doors.

kazinator · on April 10, 2015

> False promotions

> ... When you do hit the end, you do a minor collection, walking the minor heap to figure out what's still live, and promoting that set of objects to the major heap.

I somewhat deal with that issue here:

http://www.kylheku.com/cgit/txr/tree/gc.c

When make_obj exhausts the available FRESHOBJ_VEC_SIZE in the freshobj array (nursery), it first tries to shake things out by doing a partial GC. The partial GC, during the sweep phase, will rebuild the freshobj array with just those gen 0 objects that are reachable. The sweep phase of the incremental GC marches through the freshobj array, and a second index is maintained where live objects are copied back into the array. At the end, that index becomes the new allocation pointer from that array.

Only if the array is still full after this do we then have to do a full gc, and that's when any "false promotion" will happen.

One problem with this approach is that this freshobj nursery becomes a miniature memory arena in itself; as it gets close to full, these partial garbage collections get more frequent. They are not that cheap.

Hmm, it occurs to me I can address that with a one-liner:

    diff --git a/gc.c b/gc.c
    index 6832905..7a0ee1c 100644
    --- a/gc.c
    +++ b/gc.c
    @@ -173,7 +173,7 @@ val make_obj(void)
           malloc_delta >= opt_gc_delta)
       {
         gc();
    -    if (freshobj_idx >= FRESHOBJ_VEC_SIZE)
    +    if (freshobj_idx >= FRESHOBJ_VEC_SIZE / 2)
           full_gc = 1;
         prev_malloc_bytes = malloc_bytes;
       }

I.e. after doing an ephemeral GC, if at least half of the nursery space isn't free, then set the flag for a full gc. (That will happen right away if the nursery has no more objects to satisfy the current request, or else the next time gc is triggered.)

This has been a useful HN submission since it got me thinking about something, leading to some findings in a different project. :)

bjourne · on April 10, 2015

> typical advice in Java-land is to have a young generation in the 5-10 GiB range, whereas our minor heaps are measured in megabytes.

Where does that advice come from? Of what I've seen typical suggested values for the young generation in the jvm and clr is 10-20mb. A young generation measured in gigabytes would defeat the idea of dividing the heap into generations in the first place.

Also, unless you have profiled the gc and know exactly in which steps of the process most time is spent, then you are fumbling in the dark.

yminsky · on April 11, 2015

Here's an example of someone from Linked-In describing a "low latency" setup with a 6GiB heap.

https://engineering.linkedin.com/garbage-collection/garbage-...

I've heard similar things from folks at Twitter, IIRC. But I do find the whole thing kind of mysterious, I have to admit. I'd love to learn that I was wrong.

bjourne · on April 11, 2015

I don't think that person knows what he is talking about. Either you pick high-throughput or low-latency. You don't get both. They got the young generation collection pause down to 60 ms which is completely crap. :) The gc for Factor which I've been hacking on has a young generation pause of 2-3 ms (depending on object topology, generation sizes and so on). But the young generation is only 2mb so the number of gc pauses is significantly higher.

chipsy · on April 11, 2015

Although it's probably not at all what that author meant, there is a way in which throughput and latency have a kind of fractally-layered relationship. For example, if you have a distributed computation that requires all parts to be finished before any result is returned, your throughput is dependent on the average latency for each part being consistently low. If one of them gets stuck it becomes a huge bottleneck, and speeding up the rest won't matter. And at the lowest levels of optimization a similar concern arises when CPU throughput is ultimately concerned with avoiding unnecessary latency such as cache misses.

For modern systems, latency seems to be increasingly important, as the fast paths have all grown various asynchronous aspects. For some systems 60ms pause might be fine, for others 2ms might be way too much. It's definitely a "what scale are you thinking about" kind of thing.

bjourne · on April 11, 2015

What the author (and I) meant by "low-latency" is a gc were the work is very fairly distributed among the mutators allocation requests. The gc with highest throughput just allocates memory and never frees it. But then you sacrifice other properties such as extremely high memory usage..

tmd83 · on April 12, 2015

I think what he meant is an application that is serving a lot of requests at low latency which would be the ideal scenario for a api/cache server so not much of crap. And I'm not sure why you are comparing a 6G new gen to a 2MB new gen. Do you mean to say that a 70ms GC for a 6G heap is too low? It is fairly possible to hit those or even lower range for a heap of that size depending on the data. I have even heard of people hitting even lower GC pause though I myself haven't been able to do that personally.

istvan__ · on April 11, 2015

In fact low latency and high-throughput are usually best friends. You cannot maintain high throughput if your operations are taking longer. Also have a look to this picture. Using G1, doing few hundred MB/s.

http://postimg.org/image/gms25ibnl/

the8472 · on April 11, 2015

Throughput is generally measured in the fraction of CPU cycles spent on GCing. The Parallel Old Gen collector is more efficient in that regard (i.e. provides more compute-throughput) than CMS, Zing or G1, but it is not concurrent and thus you have longer STW pauses compare to the latter collectors.

The concurrent collectors trade some throughput for latency by occupying additional threads for concurrent work. Due to the additional synchronization actions (more atomic instructions, read/write barriers, additional cleanups during shorter STW pauses) they are less efficient in overall CPU cycles spent.

So it certainly is a tradeoff.

Of course a collector that can burn more memory bandwidth and CPU cycles will generally lead to lower pause times, so in that sense increasing throughput of the collector is good for latency. But increasing the collector's throughput leaves less resources for your actual application, decreasing effective throughput.

istvan__ · on April 11, 2015

I don't think that you are wrong, most of the JVM users are not forced to look into how GC works unless they are exposed to extreme high scale and load like LinkedIN, Twitter, etc. It is not uncommon to roll with 6GB+ heap. GC gets in your way if this high scale meets with low latency requirements for p99 (and above) latency, again which these guys care about a lot.

I think the best approach to GC based memory management is what Erlang does, extremely limited scope, no global locking and tiny GC time. I am not entirely familiar how the OCaml VM works, just started to play with the environment. Also, my understanding is that OCaml is not for highly concurrent systems. Anyways it is kind of offtopic here.

The summarize:

- JVM GC details are extremely important for high throughput low latency systems at scale, as far as I know the G1 GC is used for predictable low latency compactions, and I can verify that with my experiments, having 10ms GC pauses

- I think the Erlang approach is superior to garbage collected systems, but it requires no shared memory across your threads (or in the Erlang case processes), so the GC scope is tiny (and few other niceties in BEAM)

More on Java G1: http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Ru... More about Erlang's GC here: http://prog21.dadgum.com/16.html

dzderic · on April 11, 2015

A large young gen makes sense since it reduces the frequency of minor GCs but doesn't affect the duration of them, since the running time of a young gen GC is only proportional to the amount of live objects.

bjourne · on April 11, 2015

Yes it does affect the duration of them. The larger nursery you have, the more gc roots you will need to trace due to write barriers from older generations. So yes you are right that the duration is dependent on the number of live objects at gc time, but the number of live objects is also dependent on the size of the nursery.

(and whoever is down-voting me, maybe you can explain why I'm wrong?)

legoviking · on April 11, 2015

My understanding is that an entry in the card table is set when an object in young generation is allocated that is referenced by something in the old generation. An entry in the card table corresponds to a 512 byte segment of memory in the old generation. Thus, the cost imposed by this would be based on how many distinct 512 byte segments of the old generation reference any objects in the young generation.

If you have a web service that mostly consists of some baseline of long-lived objects and many short-lived objects used for fulfilling requests, I would expect to have relatively few GC roots. At that point, if you assume that you have a consistent request rate, I would expect the number of reachable objects in the young generation to remain constant regardless of the size of the young generation, and the number of GC roots should also remain constant. Based on that, increasing the young generation size would then decrease the frequency of young generation garbage collection, reduce the probability of survivors getting promoted to old generation, and have no effect on the time it takes to do young generation garbage collection. There certainly applications that have different behavior when the old generation is less static, but I would think for this use case the new generation size should be as big as it can be.

If something I've said is incorrect or incomplete, I'm anxious to know. There are relatively few well-written explanations of how Java garbage collection works, so it is difficult to have confidence regarding it without a lot of practical experience as you have said.

bjourne · on April 11, 2015

That's a good explanation of why I'm wrong. Basically you are hoping to reach an equilibrium situation in which 0% of the allocated memory in nursery are true survivors. Because if the true survival rate was higher than 0%, then the larger the nursery size the longer the duration between collections and the higher the number of objects that are true survivors.

If you had a perfect situation like that, with a giant nursery, you wouldn't even need to gc anything. When the nursery is full, just start over from address 0 and you can be confident that when the new objects starts overwriting the old that the old will already be unreachable from the object graph.

You never reach that situation in reality. Even in a simple web server some request handling thread might do something innocuous like setting a key in a cache hash somewhere leading to the hash being full and needing to be reallocated. That would dirty mark one card. And again, the longer the duration, the more of these "freak" events you get. Or there may be a string somewhere that keeps track of the current date and when it ticks over from "July 31st, 2015" to "August 1st, 2015" it triggers a reallocation because the last string is one character longer.

It may be that having a large nursery is a good trade-off because for many loads it's the same cards being marked over and over again. That may outweigh the increased frequency of tenured generation collections (memory isn't free so you must take the space from somewhere).

benjiweber · on April 11, 2015

Not an expert but my experience/basic understanding

The old generation gets at lot more expensive as it gets bigger, and I think requires at least some stop the world with all the collectors in hotspot.

New generation collections often remain quick as the size grows as long as most objects are dying young. Increasing the size of new also gives more opportunity for objects to die before being promoted (if you have lots of objects that live just long enough to be promoted it can be a good strategy to increase size of new). New can be collected concurrently.

amelius · on April 11, 2015

It would be extremely useful if GC solutions were built in a more language agnostic way. This would allow language designers to pick a GC that suits their needs, instead of reinventing the wheel every time.

mjn · on April 11, 2015

This is discussed now and then, but it's difficult to get all the architectural pieces to align to make it really work. A GC needs some kind of cooperation with the language, so you need to agree on common interfaces, which is a bit of a coordination problem.

One of the goals of the now-defunct C-- project was to do that by providing a low-level target language for compilers, which also had a portable set of GC hooks. But it didn't reach that goal before running out of funding/developers. The most popular current target language filling similar space is the LLVM IR. LLVM has some goals regarding enabling GC via standardized hooks, but not rising to the level of actually providing a language-agnostic GC: http://llvm.org/docs/GarbageCollection.html#goals-and-non-go...

Another option is to plug in a collector like Boehm (http://www.hboehm.info/gc/) or MPS (http://www.ravenbrook.com/project/mps/). Boehm or the conservative mode of MPS sidestep some of the integration/architecture issues, with various accompanying downsides.

benjiweber · on April 11, 2015

Perhaps we could have some sort of virtual machine, which could allow different languages that target it to share garbage collection solutions? ¬_¬

amelius · on April 11, 2015

I was thinking of a solution that is a little more orthogonal (modular). The JVM for example, is overdesigned for this purpose. One problem it has, is that it doesn't support proper tail call optimizations, and this affects most functional languages targeting this virtual machine. The GC of this VM, however, would be perfectly suitable for these languages.

edwintorok · on April 12, 2015

Since you brought up Java here is some information that I could find about real-time garbage collection:

* JamaicaVM's RTGC seems to be the closest to OCaml's GC, including the 'GC does more work when you allocate more', although the benefit is that threads that don't allocate don't have to run the GC which doesn't apply to OCaml (don't know about Multicore OCaml): https://www.aicas.com/cms/sites/default/files/Garbage.pdf http://www.aicas.com/papers/jtres07_siebert.pdf It seems they provide a tool (static analysis?) to determine the worst-case allocation time for an application

* FijiVM's approach: http://janvitek.github.io/pubs/pldi10b.pdf

* on eliminating pauses during GC and compaction, although with quite invasive changes to the read barrier: http://www.azulsystems.com/sites/default/files//images/wp_pg... http://www.azulsystems.com/sites/default/files/images/c4_pap...

Once you have a GC with predictable behaviour you also need a standard library with predictable behaviour (which you have with Core_kernel, right?), and Javolution has an interesting approach, they annotate the public API of the library with time constraints and the compiler can give some warnings: http://javolution.org/apidocs/javolution/lang/Realtime.html http://javolution.org/apidocs/javolution/lang/Realtime.Limit... http://javolution.org/ I don't know how this works in practice, it'll probably just warn if you try to use a LINEAR time function when you claim your function should be CONSTANT time, or if you put things inside loops. Not sure if it'd be worth trying out this approach for Core_kernel, but with OCaml annotations and ppx it might be possible to automatically infer and check:

* annotate stack depth of function (none, tail-recursive, linear-recursive, tree-recursive)

* annotate worst-case time complexity (guaranteed const (no loops, or just constant length ones, no recursion, no allocation); const (like guaranteed but with allocations); log (manual annotation); linear (recursion with decreasing list or loop); quadratic (one nested loop); unknown)

* I/O blocking behaviour (although you do this by hiding the blocking functions in Async.Std already)

AceJohnny2 · on April 10, 2015

(for OCaml)

chc · on April 11, 2015

What language would you naturally expect Jane Street to be creating a garbage collector for?

AceJohnny2 · on April 11, 2015

While I totally knew what I was clicking through, I don't expect everyone to know what Jane Street is famous for.

Know your audience.

jim_greco · on April 10, 2015

Just don't GC :)

nulltype · on April 11, 2015

I hear redis is the new heapness.