Concurrent Memory Deallocation in the Objective-C Runtime

pron · on June 6, 2015

That's a cool GC algorithm, but it has two big drawbacks:

1. It works well only if the number of GCed data structures -- i.e. the number of entry/exit points -- is relatively small (in this case, just the method cache).

2. It prevents inlining of the data structure's access functions, which adds quite a bit of overhead to each access (though not as much as two fenced CASes).

It can, however, be very useful for hot code swapping (provided there's no inlining or you have a JIT that can un-inline).

pcwalton · on June 6, 2015

I think the way to look at this is more of as a synchronization technique as opposed to a general-purpose GC algorithm.

I don't understand the objection about inlining--could you explain? You could in principle not use function boundaries to delimit the critical section: a range of PC addresses within a function would do just as well.

pron · on June 6, 2015

> a range of PC addresses within a function would do just as well

Yes, but you'll need to know where those ranges are (i.e. you'll need some feedback from the compiler), and then be very careful that no tool injects any instructions into, or moves code around your executable/library.

pcwalton · on June 6, 2015

Fascinating. I had never considered sending signals to stop all threads and check their PCs as a synchronization technique. But it works, and it brings down the uncontended lock acquisition and release time down to zero in the case where no synchronization needs to be performed.

Now I'm trying to think of other places to use this trick. :)

lgg · on June 6, 2015

It isn't actually stopping the threads. The thing to understand is that you don't need to ensure that all threads are outside of a critical section at the time you free the cache, what you need is the weaker guarantee that all threads have to have exited any critical sections they were in at the time of the pointer swap before you can call free.

You don't have to do anything heavy weight like stop all the threads, instead you can just pull their PCs out and check them. Sure, they may have moved on (and even back into the critical section) by the time you call free, but it doesn't matter because they'll be using the new cache buffer. The downside is that you might get false positives for the critical section, but you can either run the dealloc in a loop or defer it and try again later.

asveikau · on June 6, 2015

No, you pretty much have to stop them. If you observe the PC of other threads at a point in time while they are running, that is no guarantee that they aren't about to enter the unsafe region immediately after you make the check.

Your suggestion sounds akin to replacing a mutex acquire with an "is the lock held?" [like trylock with 0 timeout, then immediate release]. Observing that it's unheld at time t makes no guarantee it won't be held by the time of your next instruction. It therefore becomes a meaningless check, not useful at all for real synchronization.

[PS: mentioned this on HN before, but my favorite "observe the PC as part of a synchronization primitive" hack was this one from Linux on armv5: http://lwn.net/Articles/314561/]

[PPS: How much does objc_msgSend() do inline and how much is external calls? This PC hack seems like it could have huge holes if some of its critical work is done in a non-inlined function.]

dilap · on June 6, 2015

Hmm? No I don't think you have to stop them.

Basically you have:

    cache = <newval>
    if ok() {
        free(<oldval>)
    }

The problem you have is if anyone is still using <oldval> at the point you free it. So:

    // maybe someone here gets a reference to <oldval> (1) 
    cache = <newval>
    // from this point onward no one can reference <oldval>
    if ok() {
        free(<oldval>)
    }

So ok() only has to check if threads have a reference to <oldval> that they are still using.

Imagine ok() pauses all threads. It sees if any threads threads are in BAD=[PC_BAD_START, PC_BAD_END]. If yes, return false, if no, return true.

Now imagine PC doesn't pause the threads. What can happen?

A thread that was in BAD leaves BAD. That's fine.

A thread that wasn't in BAD enters BAD. But that will use <newval>, so that's fine too.

That thread that was in BAD leaves BAD, and then reenters. That's also fine.

So there's no problem.

(re your question, I believe objc_msgSend is hand optimized assembly that doesn't make any function calls; if it did, you'd just have to make sure to include those functions in the range of bad PC addresses.)

asveikau · on June 6, 2015

> So ok() only has to check if threads have a reference to <oldval> that they are still using.

Right. The problem with doing this while other threads are running is that ok() can return true, correctly so for its point in time, then immediately after ok() returns another thread could enter objc_msgSend() while you are inside free(). Maybe ok() ran on thread A while thread B was right at unrelated function foo()'s "call objc_msgSend" instruction. The check is OK at a point in time, but perhaps by the time you enter free() thread B did an unsafe read.

> A thread that wasn't in BAD enters BAD. But that will use <newval>, so that's fine too.

You can't make guarantees that it will see newval. Whether or not it does depends on timing. For example, maybe the guy who calls ok() gets a page fault or is on a very busy CPU with lots of preemption happening. That will alter timing in the direction of this being unsafe. Cache coherence may also be an issue here.

dilap · on June 6, 2015

You have to be atomically reading and writing cache (which you can do). So then after

    cache = <newval>

...any future reads of cache will use <newval> (or <newer-val>, but not <oldval>).

That's what makes the trick work w/o having to pause threads.

asveikau · on June 6, 2015

> cache = <newval>

As written, that is not a guarantee that all cores will see <newval> immediately. It's very CPU-specific but you may need memory fences to achieve this.

Further, in my opinion it's kind of playing with fire.

Edit: Also, it is my impression reading the article that <oldval> is actually a shared list of old caches to be freed (gOldCachesList). That makes it a lot more complicated than your example snippet and leaves more potential for nasty synchronization problems.

ridiculous_fish · on June 6, 2015

Strictly speaking you're right. However it's not necessary that other cores see the new store immediately: it's only necessary that they see it before the memory is freed, which occurs after a substantial delay sufficient to prevent these out-of-order issues on other CPUs. This is the technique by which memory barriers are elided on the read side (but not the write side).

asveikau · on June 8, 2015

Yes, you kind of do need the read barrier, otherwise you get exactly the problem I described with the stale cache.

You can't "kinda" have synchronization just as a design for a mutex can't be a "kinda" lock... You may get away with it for a while, but like I said, playing with fire.

vitalyd · on June 7, 2015

This is both clever and terrifying :). How do they prevent compiler from scheduling the load early, observing, and using the stale cache pointer? Do they have at least a compiler fence on the read side?

bdonlan · on June 7, 2015

The read-side code in question is hand-written assembly code, isn't it? And on the hardware side, observing pc will probably result in an actual memory barrier happening anyway (even if no signal is sent internally the kernel must still suspend the thread to capture register state, and then it needs a barrier to communicate that result back to the asking thread).

vitalyd · on June 7, 2015

Yes, perhaps that part is in the hand written assembly portion that's mentioned.

ridiculous_fish · on June 7, 2015

Nothing prevents using a stale cache pointer. The stale cache still contains correct data, so using it is benign, up until it gets deallocated.

A stale cache pointer may become wrong if a method implementation is replaced, e.g. by a category, but replacing a method while calling it in another thread is inherently racey.

vitalyd · on June 7, 2015

By stale I meant holding a ref to a cache about to be deleted.

lgg · on June 7, 2015

Yes, obviously you may need barriers or fences depending on the exact architectural details, and as another commenter mentioned you may be able to elide the read barrier if you know what is going on.

This all about optimizing things so the read side doesn't have to do any synchronization. The actual implementation takes locks in functions that mutate the cache, it is just done in such a way that read side always sees consistent data without taking a lock. Since the write side has locks it can safely queue stuff for deletion is actually pretty trivial.

As for playing with fire... the core dispatch routine of your runtime is exactly the place you want to complicated things that exploit innate knowledge of your architecture. It is a small piece of code that is incredibly hot and will get a lot of scrutiny.

comex · on June 7, 2015

Indeed. One nitpick though: you can't get around pausing the thread briefly or else waiting for it to hit an interrupt, because there's no way to observe another CPU core's registers without its cooperation.

lgg · on June 7, 2015

True, I should have said it is not concurrently stopping all the threads. If you actually need the register state of a thread that is currently running on another processor it may be stopped to get it (though if you ask for the state of a thread that is not currently executing you may get its saved register state out of the kernel without having to actually pause its execution, of course being able to tell what is and is not actually running on other CPUs at that granularity is not really feasible).

dilap · on June 6, 2015

As written, it's psuedo-code. ;) Yup it's tricky and platform specific, but the point is it can be done (or at least that's my understanding -- I do not have first hand knowledge).

pcwalton · on June 6, 2015

How do you get the PC of a thread in a POSIX userland without sending it a signal? ptrace(PTRACE_GETREGS), I guess?

lgg · on June 7, 2015

Totally platform specific. On OS X you generally use thread_get_state().

rayiner · on June 6, 2015

Clozure CL and I believe SBCL do this in their GCs to allow threads to be stopped for GC preemptively while dealing with sequences that must be atomic (like allocation). Clozure calls it PC lusering: http://ccl.clozure.com/manual/chapter17.1.html

pron · on June 6, 2015

It's a take on kernel-mode RCU.

1amzave · on June 7, 2015

Yes, this essentially just a form of RCU. Note however that RCU isn't just for the kernel anymore though: http://urcu.so/

sbahra · on June 7, 2015

Concurrency Kit (http://concurrencykit.org) also has ck_epoch. Performance comparison to the URCU implementations available at http://concurrencykit.org/presentations/ebr.pdf

pron · on June 7, 2015

Userspace RCU algorithms are experimental, quite brittle, and/or have little in common with RCU. They're certainly not usable as a production-ready scalable synchronization solution.

sbahra · on June 7, 2015

What makes you say this?

RCU refers to a wide array of reclamation implementations and some URCU implementations do indeed have a lot in common with kernel-space RCU (which also has a myriad of implementations).

Speaking of liburcu in specific, the "brittleness" is a function of your workload and your selection of the URCU implementation. The various implementations have different trade-offs, and it is definitely possible to livelock write-side or degrade read-side if you make the wrong design decisions. For example, there are major differences between signal-based URCU and QSBR URCU. No reclamation mechanism (whether one backed by RCU or something like hazard pointers) is a silver bullet and each will suck in fantastic ways with the right workload.

As far as userspace RCU algorithms in general being "experimental", plenty of people are using RCU or RCU-like mechanisms in user-space for production systems as a scalable synchronization solution. For example, the Concurrency Kit library (http://concurrencykit.org) has an RCU-like system called ck_epoch, and it sees at least billions of transactions in production a day without fail, and even supports freestanding environments.

1amzave · on June 7, 2015

I haven't actually used it myself and am certainly willing to believe it's immature, but how do you mean it has "little in common with RCU"? Comparing personnel between that library and this set of people --> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.... there's certainly at least some authorship in common...

pron · on June 7, 2015

RCU basically works by having a critical section only within which a reader may access the data structure. To know when to GC the old copy of the data, you wait until all readers leave the critical section. If you're in the kernel, this can be very efficient: you disable all interrupts within the CS, so you know a thread is never preempted while in it. This means that no matter how many threads there are, only those currently running on the available cores (say, there are 8 of them) can be in the CS, so no matter how many threads there are, you only need to wait for 8, and you know that that waiting period is going to be short because there are no interrupts.

So the efficiency of the algorithms stems from disabling interrupts within the reader's critical section, which can only be done in kernel mode (otherwise you could do really bad things). A similar algorithm in userspace will need to find another mechanism to wait for all readers to exit the CS, which may or may not be efficient. They may share then name "RCU", but not its desirable qualities.

There are, AFAIK, several different algorithms calling themselves userspace RCU. One userspace RCU algorithm I've seen essentially protects the deallocation with a read-write, which isn't known for terrific scalability, but I don't know, they may have come up with something better.

Garbage collection is the Achille's heel of all nonblocking data structures. As far as I know, there is no algorithm more efficient than a general purpose GC just yet, which is both generally applicable and efficient.

Common authorship, BTW, says little. Most concurrent algorithms out there were invented by a handful of people.

sbahra · on June 7, 2015

1) Efficiency of algorithms stems from disabling interrupts

The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive. In addition to that, note that a deferral interface is available so writers basically never have to block until it is actually safe to free memory. URCU and ck_epoch all share those desirable qualities.

2) "RCU algorithm I've seen ... protects with a read-write"

The synchronize portion of RCU and RCU-like systems is almost always heavy-weight. Applications that adopt URCU can typically afford this.

3) "no algorithm more efficient than a general purpose GC"

All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable. However, they're still applicable for plenty of workloads. We are starting to see more GC mechanisms adopt SMR-like techniques so that may change.

4) "Common authorship, BTW, says little"

Not in this case.

If you would like to learn more about the performance trade-off with the various mechanisms, I suggest http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf and the URCU paper as a start. I've some additional numbers on something similar at http://concurrencykit.org/presentations/ebr.pdf

Paul also has a good introduction up at https://queue.acm.org/detail.cfm?id=2488549

pron · on June 7, 2015

> The efficiency stems from having little to zero synchronization cost for readers in a read-mostly workload or workload that just isn't update-intensive.

That's the advantage of any copy-on-write concurrency mechanism. The "magic" of RCU (unless you define any copy-on-write algorithm as RCU) is the indifference of the algorithm complexity to the number of threads involved -- only the number of cores. I am not aware of userspace algorithms that display this property, but I may be wrong about that.

> All the blocking schemes are magnitudes more efficient but the trade-off is that it isn't generally applicable.

But that's where things get tricky. Magnitudes more efficient how? Total throughput? Maximum latency? I can certainly believe the latter but I doubt the former. Hazard-pointer techniques basically are how modern GCs work, with the difference of when the scans are triggered.

sbahra · on June 7, 2015

The magic of RCU definitely is the safety guarantee on read-side with little to no overhead and an easy to approach API. Traditional copy-on-write systems do not provide this guarantee as there would be synchronization and / or blocking on the fast path (barriers / atomic operations / locks).

Cores versus threads is a detail of the execution environment which may affect RCU implementation but not the actual magic. For example, a lot of work was necessary to get RCU to scale to many hundreds of cores in the kernel (http://lwn.net/Articles/305782/ is one example).

For ck_epoch, we had to implement a proxy collection-like pattern for workloads involving thousands of concurrent events that required longer-lived hazardous references (thanks to John Esmet for this work)...this will hit upstream as ck_epc and is sitting in a pull request on the github page (http://github.com/concurrencykit/ck).

re:Efficiency, magnitudes more efficient on read-side in all ways (including throughput and latency) assuming that delete frequency is not too heavy and that you can afford write-side blocking. Consider that the read-side has few to none atomic operations on the fast path, and if so, is typically on exclusively written memory and it does not block in any fashion. This means readers scale and are very fast. You might enjoy the paper I linked to which exposes some of the performance characteristics (it isn't exhaustive, it doesn't explore real-world implications on memory usage as update frequency increases, for example, but it's a great start). Paul also has some refreshed performance analysis in his ACM Queue article. If you couple this with specialized concurrently readable data structures (like http://backtrace.io/blog/blog/2015/03/13/workload-specializa...), you get super sexy performance for readers.

You might also be interested in "passive reader-writer locks", which provide similar costs on the fast path to RCU for TSO architectures while still providing read-write lock semantics...but at the cost of exceptionally more expensive write-side synchronization (and user-space support requires some kernel modifications).

pron · on June 7, 2015

> re:Efficiency, magnitudes more efficient on read-side in all ways (including throughput and latency)

More efficient than what? RCU with a general-purpose GC? I thought that's what we were talking about.

In our in-memory database (written in Java), we use what you call RCU (I had no idea any copy-on-write without any synchronization on the read side is called RCU) where we can for read-mostly data, but the problem is that in some data structures, nodes may be referenced by two references (or more), so since we don't have DCAS, we can't do the atomic replacement of the node (in some circumstances we can forgo the transactional update of on one of the references, so we can do RCU). Another problem is that some writes must block reads to ensure transaction isolation. Instead, to reduce the cost of reads updating a read-write lock, we use optimistic locking that also ensures a reader requires no memory writes, but may retry due to a concurrent writer.

sbahra · on June 7, 2015

More efficient than using garbage collection, if you're working with a language where you can operate in an unmanaged mode (barring FFI). I do expect this gap to start closing based on some literature some friends have shared with me...

With most managed languages I've played with, I haven't found a compelling reason to use RCU-like mechanisms (except if it involves some FFI boundary or embedded language in unmanaged system).

What you're doing sounds like a seqlock pattern (see ck_sequence) or generation counter, not RCU. Fundamental to all RCU implementations is "grace period detection" (detection of the point in which it is safe to free memory) occurring without any read-side delays. With RCU, a reader never has to retry due to RCU itself.

RCU is really special in an unmanaged language, but I'm not sure it's that interesting for developers working in a managed language (unless they happen to be solving safe memory reclamation).

pron · on June 8, 2015

The optimistic locking I mentioned is what we use when RCU is inappropriate (due to multiple pointers).

But just to get the definitions straight: what you call RCU is a copy-on-write combined with a GC algorithm. What would you call a copy-on-write (with no synchronization on the reader's end whatsoever) that uses a general-purpose GC?

Also, I don't understand how hazard pointer GC can be more efficient than a general-purpose GC, given that modern GCs work based on the same principle, only much more refined (e.g. HotSpot creates a stack map containing the location of each reference in every thread's stack instead of a hazard pointer list). Of course compacting collectors (usually young-generation) don't do any reclamation work at all, instead they only work to copy the live objects, so it all comes down to the question what do you have more, live objects or dead objects? But in any case, simple hazard pointer mechanisms seem like a crude, rather old, non-compacting GC technique. I don't see how they can beat a good general-purpose GC.

sbahra · on June 8, 2015

Not comparing to HP/PTB/LFRC, just RCU and similar schemes. You're right in that the first class are easy to beat given appropriate workload.

RCU is being conflated, the (Hart) paper I linked to will addresses your questions (sorry phone).