> In the fast path, allocation of memory doesn't need to be any more than 5 or 6...

pcwalton · on Feb 14, 2018

Yeah, the canonical solution for multithreaded GC is the third option (TLABs). The TLS overhead is annoying, but on some architectures you can get away with burning a register to save the TLS load. It might well be worth it on AArch64, with its 32 GPRs...

(TLABs are the recommended solution for multithreaded malloc implementations like jemalloc and tcmalloc as well.)

cwzwarich · on Feb 14, 2018

AArch64 has a dedicated register (TPIDR_EL0) for TLS.

pcwalton · on Feb 14, 2018

Didn't know that, thanks!

sitkack · on Feb 15, 2018

https://www.semanticscholar.org/paper/Hierarchical-PLABs%2C-...

Tarean · on Feb 14, 2018

What is the TLS overhead? On x64 you presumably could spare two registers, one for the current heap pointer and one for the end of the heap. How could you implement this more efficiently with a single threaded version?

Not sure if LLVM has an equivalent of global register variables, though.

MHordecki · on Feb 14, 2018

#3 is a thread-local allocation buffer, which is what HotSpot is using. https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/

MaxBarraclough · on Feb 14, 2018

I must be missing something. From the article:

> the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics)

...

> For Epsilon, the allocation path in GC is a single compare-and-set - because it issues the memory blocks by pointer-bumps itself.

So both the TLABs and the Epsilon GC are doing pointer-bumps, right? Why does TLAB outperform no-TLAB even in the single-thread case? Just because the pointer-bump instructions are inlined?

aidenn0 · on Feb 14, 2018

Aside from the contention overhead, per-thread nurseries also has a fringe benefit for SMP platforms with exclusive L1 caches; short lived data is unlikely to be shared, so having threads allocate from different cache lines prevents thrashing the lines across cores.

CyberDildonics · on Feb 14, 2018

Why would any allocator split a cache line to multiple threads, and why would a program use dynamic memory allocation for something so small it would fit in the L1 cache?

aidenn0 · on Feb 14, 2018

If your dynamic allocator is a pointer-increment (as it is in most copying generational GCs), then if you have a single global nursery, then two threads allocating after each other would get adjacent locations in memory when allocating.

On SBCL, it's pretty normal to dynamically allocate a cons-cell, which is the size of two pointers. The garbage collector pays zero cost for any garbage in the nursery when it is run, so the performance overhead versus stack allocation is approximately zero.

I've seen language implementations where there is no stack at all (it's one of the most straightforward ways to implement scheme). Chicken scheme works this way, for example (it cleverly uses the C stack as the nursery, since functions never return in chicken and since a copying nursery allocation is just a pointer-increment, which is what a C stack allocation is).

chrisseaton · on Feb 14, 2018

What was the TLS overhead in practice? On most architectures it's just indirecting through one register isn't it?

maximilianburke · on Feb 14, 2018

I can't remember the exact numbers, I last worked on the project in 2011.

If I recall though, because our compiler generated DLLs, we weren't able to use the TLS fast path options, we needed instead to use the TLS API (ie: TlsAlloc/TlsGetValue on Win32) which slowed things down a bit.