Lock-free programming for the masses

RossBencina · on June 15, 2016

The library is called "reagent". It's for Ocaml. From what I gather skimming the post, the operations are: "swap" (synchronous rendezvous exchange similar to CSP channels maybe) and "upd" (similar to CAS). With composition operators (sequential, conjunction, disjunction). The implementation aggregates composed operations into a transaction. It looks as though the operators would let you implement something along the lines of Go's select() for example.

From the post:

> The key idea behind the implementation is that the reagent transaction executes in two phases. The first phase involves collecting all the compare-and-swap (CAS) operations necessary for the transaction, and the second phase is invoking a k-CAS operation (emulated in software)

> Reagents are less expressive than STM, which provides serializability. But in return, Reagents provide stronger progress guarantee (lock-freedom) over STM (obstruction-freedom)

Interesting.

kaeluka · on June 15, 2016

It's based on the (fantastic!) paper by Aaron Turon: www.ccs.neu.edu/home/turon/reagents.pdf

vardump · on June 15, 2016

> second phase is invoking a k-CAS operation (emulated in software)

Sounds like a normal CAS performed on a pointer.

naasking · on June 15, 2016

It synchronizes multiple CAS operations an ensures they all execute "atomically" upon k-CAS completion.

novaleaf · on June 15, 2016

as a 'expert' multithreading programmer, i can confidently say that most developers should stay away from fine-graned multithreading (generally speaking of course). You will have much greater productivity gains and less debugging headaches (race conditions!) If you architct your program to be coarse-grained (modules running asynchronously, but each module not multithreaded) or even better, the entire app being single threaded and just fork it to scale horizontally.

MaulingMonkey · on June 15, 2016

Even if you have a legitimate need to get fine-grained, for me the most common use of lock free programming is to implement the underlying infrastructure for task based systems, where the individual tasks desperately try to stay away from multi-threading hazards - by e.g. keeping shared data immutable (no need for 'lock-free programming'), and keeping mutated data task-local (not shared between threads, and thus again no need for 'lock-free programming'.)

Forking / multi-process architectures are of course the logical extension of this - where you're making the entire program state task local, just to be sure. Rule 37: There is no 'overkill.' There is only 'open fire' and 'I need to reload.'

programLyrique · on June 15, 2016

This library is built upon multicore Ocaml. We've spoken about multicore OCaml for a long time, and apparently, the last tentative of it is being developed. When will multicore Ocaml land into the stable version of the compiler?

kcsrk · on June 15, 2016

(I work on multicore OCaml) Multicore OCaml has been steadily under development, with the latest updates here: https://ocaml.io/w/Multicore. While I can't promise when multicore lands in trunk, a stable-alpha version installable through opam should be ready in the next few weeks.

Animats · on June 15, 2016

It's time for better CPU support for atomic operations. "Backoff, retry, blocking and signalling" is a hack. What we really need is machine level support for short memory transactions. This would be an extension of fence hardware.

What's needed is a CPU instruction to fence a critical section and lock up two cache lines while you manipulate them from one CPU, with other CPUs locked out. Transactions should have a hardware instruction cycle limit, so you can't lock up the other CPUs for very long. Then you could do swaps, sets, appends to lists, and such with straightforward code.

This locking would be local to the two cache lines - unless two CPUs are trying to hit the same two cache lines, there's no delay. Why two cache lines? That lets you do moves and swaps, and maybe some list and queue manipulation. Those can all be completed in a few instruction cycles.

pizlonator · on June 15, 2016

How is this any better than using a high-speed adaptive lock?

I would claim that it's strictly worse, since it limits you to only two cache lines, doesn't have adequate interaction with the OS scheduler, and doesn't scale gracefully for longer critical sections.

davidtgoldblatt · on June 15, 2016

I assume interrupts would be disabled or deferred during this section, so you get stronger guarantees on worst-case behavior; hence the bound on the number of cycles a thread is allowed to execute for.

On one of the less common unixes (Sun's maybe? I don't totally remember), you could set a bit in some thread-specific part of memory to indicate that you were uninterruptible. If the OS saw that bit was set, it would set a "you need to yield" bit in the same part of memory and then allow the thread to continue (and penalize threads that abused the feature).

As a practical matter, I suspect that the scheme proposed wouldn't have advantages over a careful use of HTM.

I also suspect that the "lock two cachelines" thing is harder to implement than it sounds. The way Intel implements atomic operations that span cachelines now involves waiting for a interconnect quiescence period (taking O(microseconds)). What's proposed here is strictly more general.

pizlonator · on June 15, 2016

Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped. It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.

As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).

Interesting about the difficulty of locking two cache lines. To me the important thing to keep in mind when talking about concurrency alternatives to the status quo is: nobody has proved that contended-but-not-racy critical sections are common. According to my data, they are unicorns. This means that from a performance standpoint, any proposed concurrency protocol that is different from locking must justify how it plans to beat locks on the two things that they are super good at, which are uncontended critical sections (with locks you pay one or two CASes on cache lines you probably already own) and contended-and-racy critical sections (with locks you pay for some CASes but the lock will make threads contend as quietly as possible to allow the lock owner to proceed quickly). If a proposal is worse than locks on the two most common kinds of critical sections, then that's silly. AFAICT, all of these trasactions-or-better-CAS approaches would only help in the case of contended-but-not-racy critical sections, which are too rare to matter. The two cache line thing definitely sounds like it will be more expensive than locking in the uncontended case, and I don't see how it will help in the contended-but-racy case.

EDIT: I said "contended-but-racy" when I meant "contended-but-not-racy" in one place, above. Fixed.

davidtgoldblatt · on June 16, 2016

> Disabling interrupts prevents you from using interrupt-unsafe logic inside the critical section, which is impractical in most cases. It would mean, for example, that you wouldn't be able to touch memory that had been swapped.

Presumably the semantics of the "lock these two cachelines" instruction trap when one of the cachelines you're trying to lock isn't present, rather than when you perform an operation on that cacheline subsequent to the locking.

> It also doesn't actually improve performance. When people do this, it's because they want to edit process-local data structures. Unless you're in kernel, you probably don't want this.

I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

To be clear, I don't think this instruction is a good idea; I just don't think it's obviously bad.

> As a practical matter, HTM is dead. It's 2x slower than locking in the common cases: uncontended critical section or contended critical section that has a race. It also prevents you from doing effects outside of memory (it's just guaranteed to revert to locks in that case, so you pay all of the overhead of HTM and all of the overhead of locks).

HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging? Certainly that will never be cheap. I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process. There's also plenty of situations in which data structures are touched mostly by one thread, but periodically need contention management. The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

pizlonator · on June 16, 2016

> Presumably the semantics of the "lock these two cachelines" instruction trap when one of them is paged out, rather than deferring the trap until you write to one of the cachelines you've locked.

Right, I didn't think of that.

> I don't totally follow; off the top of my head, lock-free reference counting and doubly-linked list removal both get significantly simpler. There's lots of algorithms that get much less complicated and dodge more atomic ops if you have access to CAS-2, and this is strictly more powerful.

I was talking about disabling interrupts and not lock-freedom in general or CAS-2. I agree that CAS-2 would make those things simpler if it was also fast enough. Sorry about the confusing context switch.

> HTM is already showing wins in some domains, and it's only going to get faster relative to other concurrency primitives.

Can you define what "some domains" is? I think it's important to be specific.

> I'm not totally sure what you mean by "contended and racy" and "contended but not racy"; by "racy" do you mean cacheline ping-ponging?

"Contended-and-racy" means that if you took the lock away, you'd have a data race. It means that if you used a transaction then the race detection that aborts a transaction would conclude that there had been a race and abort. Contended-and-racy is exactly the case where TM is not a speed-up, pretty much by definition.

If contended-but-not-racy critical sections were common, then it would make sense to pay some base cost for executing in a TM critical section since it would increase concurrency. But most contended critical sections are also racy. TM won't give you a speed-up in a racy section, so you end up paying all of the cost without getting any of the benefit.

> I think most of the desire for lock-freedom comes less from fast-path cycle reduction as much as protection from the scheduler or some very slow process.

Yeah, and also deadlock avoidance. And that lock-free algorithms are often faster in the uncontended and contended-and-racy cases. I'm a big fan of lock-free algorithms based on conventional word CAS.

But that's different than TM or 2-cache-line-CAS. Lock-free algorithms based on conventional word CAS tend to be faster than locks in the uncontended case. TM, and probably 2-CAS, is slower than locks. It's easy to justify handling uncommon scheduling pathologies, or using exotic approaches for avoiding deadlock, if it also gives you a speed-up. Not so much if it's slow like TM.

> The overhead of locking in the single-threaded case can be substantial even if the lock acquisition always succeeds; on my machine a non-atomic CAS is about 6x faster than an atomic one even if the CAS always succeeds (and this was with no stores in between the operations; a more realistic example would have a deeper write-buffer).

It's true that the uncontended cost of locks is dominated by the uncontended cost of CAS. But this cost is much smaller than the unconteded cost of HTM, and I suspect that the unconteded cost of CAS is also lower than the uncontended cost of 2-cache-line-CAS. If the uncontended cost of 2-cache-line-CAS is more than 2x more than the uncontended cost of CAS, then probably using a lock is better.

gpderetta · on June 15, 2016

Transactions compose and TM is potentially easier to use.

Also some algorithms necessarily require locking multiple mutexes (think for example of a wait-for-any operation). HTM in principle would allow implementing them more efficiently.

pizlonator · on June 15, 2016

Transactions don't compose with I/O or any OS side-effects that the TM system isn't in the loop on, while locks compose fine with those things. One module that uses locks is free to call another module that uses locks so long as the relationship is one-way like user->kernel, so locks compose also, just with different constraints. In short, transactions don't compose with I/O while locks don't compose with cyclic relationships. So, the statement that "transactions compose" is false in both directions: it's false because transactions don't compose in general, and it's false because it implies that locks don't, while they in fact do.

The only empirical evidence about transactions being easier to use is actually contrary: even with the availability of HTM, it's not used nearly as widely as locks, because if you enable HTM for every lock then everything runs a lot slower (or doesn't run at all).

gpderetta · on June 16, 2016

Under the proposed C++ TM, synchronised blocks can do i/o. They are not transactional, but do compose.

If you have two locked data structures and you want an operation to manipulate both atomically you need to expose the locks. And if you want the operation itself to compose with others... well get hairy.

TM is widely used I believe by least Haskel and Closure programmers, even with non hardware accelerated versions because of the ease of use.

HTM is another story, it's availability is still relatively restricted. Time will tell.

pizlonator · on June 16, 2016

Closure is indeed TM-based, but that's not an A/B test of ease-of-use. I don't think there exists evidence to suggest that transactions are easier to use.

I think you are confused about the meaning of composability. The synchronized/atomic TM proposal for C++ is roughly single-global-lock. This means that you can still deadlock if you use synchronized to protect IPC with another process, which then does an IPC call back to you, which you service on another thread that also uses synchronized. Whenever these transactional proposals fall back on locking, they immediately bring in the pitfalls of locking.

The C++ TM proposal is a joke. There's no good reason to use concurrency if you won't get a speed-up, and the current design of synchronized is sure to make it more difficult to get speed-ups.

gpderetta · on June 15, 2016

Basically you are asking for DCAS (atomic compare exchange on two memory locations). There is a reason the last architecture (IIRC) to support it was the 680x0, the required extensions to the coherence protocol are hard to get right without deadlocking the machine and even harder to make it perform.

AMD internally had a design of a restricted HTM that would have guaranteed exclusive access to up to 7 cachelines (using an ad-hoc arbiter), enough to implement ana atomic and dequeues between two lists; they ended up documenting a variant that would only have guaranteed best effort and and are yet to implement anything so far.

Interestingly Intel CPUs do lock 2 cachelines in the case of an unaligned RMW spanning two cachelines. It takes thousands of clock cycles to complete and stalls the whole system; it exists only for compatibility with legacy software and it's use is strongly discouraged.

sbahra · on June 16, 2016

It reverts to a bus lock in this split transaction case, not cache line locking (thus stalling whole system).

MaulingMonkey · on June 15, 2016

> What's needed is a CPU instruction to fence a critical section and lock up two cache lines while you manipulate them from one CPU, with other CPUs locked out. Transactions should have a hardware instruction cycle limit, so you can't lock up the other CPUs for very long. Then you could do swaps, sets, appends to lists, and such with straightforward code.

This is a perfect description... of non-atomic code. There's nothing atomic about mutual exclusion of cacheline access over multiple CPU instructions. As a bonus, we know how to lock a mutex without limits on instruction cycles etc. - this prevents any weird code geneneration limitations on JITs and compilers, meaning you can write normal straightforward code. You are, of course, recommended to lock for as short a time as possible where it makes sense as you say - so you don't lock up the other CPUs for very long.

With this in mind... what exactly is your requested change to the status quo? Is it for cheaper mutex (un)locking instructions in non-contested scenarios? Or is it for other cores hitting the same cacheline as your "atomic" operation to do something different than what they currently do? Currently, common approaches include:

- wait for the other operation to complete (aka spinlock) - switch to another, more productive task (aka block / context switch) - speculatively execute code and then throw away the result if there was a conflict (e.g. STM - you can do such things with existing atomic operations, although hardware support for conflict detection might be nice)

Or maybe it's to lock by cacheline instead of explicitly named mutexes? This is easy enough to do with current tech - have an array of, say, 256 mutexes - and then lock the mutex at index size_t(address)/64%256. Maybe choose a prime number of buckets to reduce collisions - not an option if such a mechanism was baked into the silicone.

EDIT: Or maybe you just want some instructions to discourage the OS from preempting your thread while holding inner locks?

hacknat · on June 15, 2016

What happens in a preemption? An OS would have to manage this on a per-process level to ensure that a lock doesn't stick around on a preemption. Now the OS has to remember locking info per-process. Also, on resumption of a process, an OS would need a way of locking, but bypassing the lock to set the registers that have been manipulated, without other CPUs interfering.

Your idea is nifty, but it would be a nightmare to implement.

RossBencina · on June 15, 2016

Are you aware of Intel's Transactional Synchronization Extensions (TSX)? Initially it was buggy, but it has been functional since Broadwell F0 stepping in 2014 (According to Wikipedia).

https://en.wikipedia.org/wiki/Transactional_Synchronization_...

naasking · on June 15, 2016

See the LtU post [1] for the paper backing this project.

[1] http://lambda-the-ultimate.org/node/5237

dwarman · on June 16, 2016

I always thought "lock free" meant no hidden pauses in execution - a write is a write and a read is a read. The primitives described here will block. How is that different in effect from a lock, albeit one implemented down in the library rather than explicitly by the user? My lock free mechanisms never blocked. And they're used in thousands of emedded devices without failures.

asb · on June 15, 2016

There was a little bit of discussion when this was previously on HN: https://news.ycombinator.com/item?id=11893911 (linking for completeness - I'm really glad this has seen more attention after resubmission!)

deadgrey19 · on June 15, 2016

This sounded so promising until the OCaml part. The problem is, the masses don't program in OCaml. They program in C# or Java, or C++. So this is really lock-free programming for OCaml, which is a lot less relevant.

wolfgke · on June 15, 2016

In the original paper (see https://news.ycombinator.com/item?id=11908821) it is impplemented in Scala.

deadgrey19 · on June 15, 2016

How is that different? Neither Scala nor OCaml rank in the top 10 programming languages according to several metrics/organisations [1-5]. Keeping in mind that programming language use is a heavily skewed distribution, OCaml and Scala have almost zero usage relative to the big ones. So, again I must ask, how is this relevant to the masses?

[1] http://www.tiobe.com/tiobe_index

[2] http://pypl.github.io/PYPL.html

[3] http://spectrum.ieee.org/computing/software/the-2015-top-ten...

[4] http://redmonk.com/sogrady/2016/02/19/language-rankings-1-16...

[5] http://www.codingdojo.com/blog/9-most-in-demand-programming-...