Hacker News new | past | comments | ask | show | jobs | submit login
Building a lock free continuous ring buffer in Rust (ferrous-systems.com)
200 points by Argorak on June 4, 2019 | hide | past | favorite | 109 comments



Looking at this blog-post more carefully... I'm not convinced that this ring buffer is actually correct unfortunately.

> buffer.write.store(buffer.write.load() + write_len)

This is... an atomic load, followed by an add, followed atomic store.

A TRUE lock-free queue would be buffer.write.AtomicAdd(write_len), (which is a singular, atomic add).

There are many other issues here, but this is the most egregious issue I was able to find. The whole thing doesn't work, its completely non-safe and incorrect from a concurrency point of view. Once this particular race condition is solved, there's at least 3 or 4 others that I was able to find that also need to be solved.

EDIT: Here's my counter-example

    write = 100 (at the start)
    write_len = 10 for both threads.

    |-----------------------------------|
    |   Thread 1      |   Thread 2      |
    |-----------------------------------|
    | write.load (100)| write.load (100)|
    | 100+write_len   |                 |
    | write.store(110)| 100+write_len   |
    |                 | write.store(110)|
    |-----------------------------------|

    write = 110 after the two threads "added" 10 bytes each
Two items of size 10 were written to the queue, but only +10 bytes happened to the queue. The implementation is completely busted and broken. Just because you're using atomics doesn't mean that you've created an atomic transaction.

It takes great effort and study to actually build atomic transactions out of atomic parts. I would argue that this thread should be a lesson in how easy it is to get multithreaded programming dead wrong.

---------

For a talk that actually gets these details right... I have made a recent submission: https://news.ycombinator.com/item?id=20096907 . Mr. Pikus breaks down how atomics and lock-free programming needs to be done. It takes him roughly 3.5 hours to describe a lock-free concurrent queue. He's not messing around either: its a dense talk on a difficult subject.

Yeah, its not easy. But this is NOT an easy subject by any stretch of the imagination.


The article is discussing an exclusive-writer, exclusive-reader ring buffer. Interleaved writers is intentionally not considered.

> A common strategy to reduce the amount of coordination that needs to happen between the two threads (writer, reader) is to associate each coordination variable (pointer) with a single thread that has exclusive write access to it.


> reduce the amount of coordination

Interesting premise, but it doesn't hold up to scrutiny.

1. They have three shared variables.

2. Those three shared variables are being falsly shared in that data-structure.

    struct ContiguousAsyncBuffer {
      buf: *mut u8,
      len: usize,
      read: AtomicUsize,
      write: AtomicUsize,
      watermark: AtomicUsize,
    }
Look at that struct, all of the atomics are inside the same 64-byte cache line!! That's the worst sharing possibility imaginable!

3. Therefore: the dumb-and-simple "Mutex / Spinlock" would actually have less sharing (share only a ONE variable: the mutex), without any false-sharing of various atomic variables.

---------

I kid you not: a "lock/unlock" queue probably performs better than the atomic thing they did here (and the atomic-thing they're doing doesn't even work correctly).


You keep on stating that the their queue doesn't work but you have yet to provide any correct arguments or evidence that it doesn't work.

The only argument you provided was simply incorrect as the article is only concerned with a single reader and writer.

You are right that adding some padding might improve performance.


Using a spinlock / mutex would improve performance if we're only talking about 2 threads. That's the main problem. Spinlocks are also much, much, much easier to think about.

Frankly, the only reason why this implementation "works" is because they're taking advantage of SeqCst, and the article spends absolutely no time talking about why SeqCst is the cornerstone of this implementation.

They actually don't need to use atomics at all when restricted to 2 threads. They just need the seq-cst. Atomics entirely are irrelevant in the 2-thread case.

----------

In any case, code like:

> buffer.write.store(buffer.write.load() + write_len)

This is horribly, horribly, horribly bad code. This is bad, really bad, no one should be encouraged to write code like this.

Yeah, it ends up being correct because of SeqCst and the very specific assumptions of the blog post. But is that really a victory worth talking about?

This kind of code should NOT be encouraged. The reason why its "correct" is incredibly subtle and complicated. Its incredibly difficult for me to even put it to words why this seems like it'd work... without giving a ton of links to various memory-consistency talks and papers.


> the very specific assumptions of the blog post.

Those assumptions can be encoded by the borrow checker, so the user of the library doesn't have to be concerned about them. When writing a library, you can lean on all the assumptions you want, so long as you don't share that burden/responsibility with the user.

People use mutexes daily, most of those people likely have no knowledge about how the threading cooperation works in the kernel or the fact that the kernel assumes that you are in a user-space thread.

> This kind of code should NOT be encouraged.

Most lock-free code satisfies that constraint. Lock-free is like crypto: unless as a learning exercise, writing it should be avoided. It always requires many eyes and fine-tooth comb reviews.


I’d love to see a competing version with benchmarks.

I’m not saying you’re wrong. But your making a lot of strong arguments without any evidence. Is 3 atomics on the same cache line a legit problem? I don’t know! But you’re gonna have to prove negative behavior for me to integrate that knowledge into my workflow.


If there's contention between multiple CPU cores (and worse, sockets), 3 atomic ops on same cache line puts the max throughput to 3-15M operations per second per system. Probably around 10M on most 2-4 core CPUs. Multi CPU-socket systems will be slower.

That is, the point when the whole computer is on its knees due to cache coherency traffic between cores.

Note: these are just ballpark figures based on real life experience on modernish x86-64 systems. Your mileage may vary. Always measure actual performance. Etc.


> Is 3 atomics on the same cache line a legit problem?

This is a commonly known problem called false sharing. https://en.wikipedia.org/wiki/False_sharing

To be fair: I don't know how these particular atomics work, maybe they're all on their own cache lines, and 192 bytes are used up for those three atomics so that things are done properly (unlikely, but possible). But atomics in my experience usually assume the "external" programmer figures out the buffering issue to prevent false sharing.

2. These are atomics declared to be SeqCst memory ordering. That means the L1 (and probably L2) caches are flushed every time you read or write to them on ARM systems. (x86 has stricter guarantees and almost "natively" implements SeqCst, so you won't have as many flushes on x86). The cache must be flushed on EVERY operation declared SeqCst. Its absurdly slow, the absolute slowest memory model you can have.

3. Well-written spinlocks are absurdly fast. Benchmark them if you don't believe me. The assembly in x86 is

    Lock: 
    while (AtomicExchange(spinlock, true) != false) hyperthread_yield(); // loop until you grab false.
    acquire_barrier(); // Prevent reordering of memory

    Unlock: 
    release_barrier(); // Prevent reordering of memory. NOP on x86 but important in ARM / POWER9
    spinlock=false; // Don't even need an atomic here actually
Literally one assembly instruction to unlock, 3-assembly instructions to lock (the atomicexchange + loop) + hyperthread_yield if you wanna play nice with your siblings (reduces performance for you, but overall makes the whole system faster).

From a cache-perspective, the spinlock will play nice with L1 cache (if a thread unlocks, then locks again before a new thread comes in, it all stays local to L1 and no cross-thread communication occurs).

So pretty much every aspect of the hypothetical spinlock implementation is faster in my mind. I don't really need to benchmark to see which one is faster.

EDIT: At bare minimum: because the Spinlock will be properly written using Acquire / Release half-barriers, they will be ABSOLUTELY faster than SeqCst Atomics. You'll need to use Acquire/Release Atomics if you want to "keep up" with the Acquire/Release based Spinlocks.


A good spinlock, as Linux uses internally for example, is much more complicated. But, if you know there are only two relevant threads, you can get decent performance with a simple spinlock.

However, do not use AtomicExchange in a loop like this. When you are waiting, you always want to wait with the target cache line in the shared state, and AtomicExchange will force it to be exclusive. Performance will go down the drain. Instead, do something like:

    if (value looks good)
      AtomicExchange;
      Loop again if you lost the race;
    else
      Loop again;


if you expect your lock to uncontended, forcing an exchange might be better. You can then switch to the test and test and set pattern of failure.


> That means the L1 (and probably L2) caches are flushed every time you read or write to them on ARM systems. (x86 has stricter guarantees and almost "natively" implements SeqCst, so you won't have as many flushes on x86). The cache must be flushed on EVERY operation declared SeqCst. Its absurdly slow, the absolute slowest memory model you can have.

SeqCst is definitely the slowest memory model to use, but the bit about flushing caches doesn't sound right. SeqCst on a processor with a relaxed memory consistency model usually 'just' requires core-local operations on the load/store buffers - usually draining the buffers (but that is not strictly necessary). It isn't necessary to flush caches to enforce consistency if the memory system has coherent caches.

Fences for sequential consistency are certainly heavy on ARM and it will show up as poor performance, I'd be very surprised if any aarch64 processor flushed any cache on LDAR/STLR.

I'd appreciate a correction if I've misunderstood or if there are modern ARM processors that do flush L1 on LDAR/STLR/DMB ISH.


You bring up some good points, and have forced me to consult documentation to double-check a lot of intricate details.

I'm using the following as a reference: https://static.docs.arm.com/100941/0100/armv8_a_memory_syste...

> Fences for sequential consistency are certainly heavy on ARM and it will show up as poor performance, I'd be very surprised if any aarch64 processor flushed any cache on LDAR/STAR.

I agree. But LDAR / STLA are weaker than full sequential consistency. They're only acquire/release consistency.

So its the DMB ISH that I'm curious about. If the memory is never flushed out of L1 cache, how can you possibly establish a total sequential ordering with other threads?

EDIT: I had a bad example here originally. Lets use the SeqCst example from here instead: https://en.cppreference.com/w/cpp/atomic/memory_order#Sequen...

    std::atomic<bool> x = {false};
    std::atomic<bool> y = {false};
    std::atomic<int> z = {0};
     
    void write_x()
    {
        x.store(true, std::memory_order_seq_cst);
    }
     
    void write_y()
    {
        y.store(true, std::memory_order_seq_cst);
    }
     
    void read_x_then_y()
    {
        while (!x.load(std::memory_order_seq_cst))
            ;
        if (y.load(std::memory_order_seq_cst)) {
            ++z;
        }
    }
     
    void read_y_then_x()
    {
        while (!y.load(std::memory_order_seq_cst))
            ;
        if (x.load(std::memory_order_seq_cst)) {
            ++z;
        }
    }
     
    int main()
    {
        std::thread a(write_x);
        std::thread b(write_y);
        std::thread c(read_x_then_y);
        std::thread d(read_y_then_x);
        a.join(); b.join(); c.join(); d.join();
        assert(z.load() != 0);  // will never happen
    }
Variable x is being written to by "write_x". While variable y is being written to by "write_y" thread. Assume they're on different cache-lines.

* Thread C may see "X=True THEN Thread C THEN Y=True" (Formally: Thread A happened before Thread C happened before Thread B). In this case, Z=0.

* Thread D may see instead "Y=True THEN Thread D THEN X=True" (Formally: Thread B happened before Thread D happened before Thread A). In this case, Z=0.

This apparent contradiction is possible in Acquire-Release consistency, but not possible in Sequential Consistency. LDRA and STLR provide only acquire-release consistency.

So now lets ask: how do you implement sequential consistency?

There's no way for Thread A or B to do things correctly from their side. Thread A independently writes "X" in its own leisure, not knowing that Thread "B" is writing a "related" variable somewhere else called Y. There is no opportunity for A or B to coordinate with each other: they have no idea the other thread exists.

---------

To solve this problem, I imagine that an MESI cache will simply invalidate the cache (which "forcibly ejects" the data out of the core running Thread A and Thread B). Whatever order Thread A and Thread B create (X then Y... or Y then X) will be resolved in L3 cache or DDR4 RAM.

Now... perhaps there's a modern optimization that doesn't require flushing the cache? But that's my mental model of sequential-consistency. Its slow, but that's the only way I'm aware of that establishes the strong SeqCst guarantee.

---------

However, you are right to challenge me, because I don't know. If it makes you feel better, I'll "retreat" to the more easily defended statement: "Load/Store Buffers of the CPU Core need to be flushed every time you SeqCst load/store on ARM systems" (which should effectively kill out-of-order execution in the core).

The question I have is how does L3 cache perceive a total-ordering without the L1 cache flushing? I guess you've given me the thought that its possible for it to happen, but I'm not seeing how it would be done.


There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of. Such a system would perform so slowly as to be practically unusable.

At most there is usually a local dispatch stall while store buffers (not caches) are flushed and pending loads complete. In some scenarios, like a cache-line crossing atomic operation on x86, you might need additional non-local work such as asserting a lock signal on an external bus. There might be some other work such as ensuring that all arrived invalidation requests have been processed.

Still, you are talking in the range of 10s or maybe low 100s of cycles. Nothing like a cache flush which would probably be an impact of 10,000s of cycles or more (depending on how large the caches are and where the data to reload them comes from).


Exactly.

I just wanted to add that #StoreLoad fences (i.e. mfence on x86 for example), as far as I know do not usually actually flush the store buffer per-se. They just stall the pipeline (technically they only need to stall any load) until all stores prior to the fence have been flushed out by the normal store buffer operations, i.e. the store buffer is always continuously flushing as fast as possible all the time.

You didn't imply otherwise, but I wanted to clarify that because I have seen comments elsewhere and in code claiming that a fence would make a prior store visible faster (i.e. the fence was added to improve latency instead of being required for correctness), which I do not think it is the case, at least at a microarchitectural level (things are more complex when a compiler is involved of course).


Yes, that's right. As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state - so you can hide a lot of the cost of atomic operations by ensuring there is a long-as-possible series of non-load/store operations after them (of course, this is often not possible).

mfence used to work like that, but due to various bugs/errata was "upgraded" to now block execution of all subsequent instructions until it retires (like mfence) in addition to it's store draining effects. So mfence is actually a slightly stronger barrier than atomic operations (but the difference is only apparent with non-WB memory).

If you want to be totally pedantic, it may be the case that that mfence or another fencing atomic operation results in the stores being visible faster: because they block further memory access instructions, there can less competition for resources like fill buffers, so it is possible that the stores drain faster.

For example, Intel chips have a feature where cache lines targeted by stores other than the ones at the head of the store buffer can be fetched, so called "RFO prefetch" - this gives MLP in the store pipeline. However, this will be limited by the available fill buffers and perhaps also heuristics ramping back this feature when fill buffers are highly used even if some are available (since load latency is generally way more important than load latency).

So something like an mfence/atomic op blocks later competing requests and gives stores the quietest possible environment to drain. I don't think the effect is very big though, and you could achieve the same effect by for e.g., just putting a bunch of nops after the "key" store (although you wouldn't know how many to put).


> As far as I know, on modern Intel chips, atomic operations block both the store and load ports, but lets other ops through. I think allocation blocks when the first load/store arrives when the pipeline is in that state.

That's great to know. I suspected that was the case and they had moved from the stall the pipeline approach, but I had never tested it.


Sorry that should say "(like lfence)" not "(like mfence)".


> There is no "cache flushing" when barriers or other instructions are used to ensure seqcst on any system I am aware of.

Good to know. I've seen enough of your other posts to trust you at your word.

BTW: I'll have you know that AMD GPUs implement "__threadfence()" as "buffer_wbinvl1_vol", which is documented as:

> buffer_wbinvl1_vol -- Write back and invalidate the shader L1 only for linesthat are marked volatile. Returns ACK to shader.

So I'm not completely making things up here. But yes, I'm currently looking at some GPU assembly which formed the basis of my assumption. So you can mark at least ONE obscure architecture that pushes data out of the L1 cache on a fence / memory barrier!


GPU L1 caches are typically not coherent, so flushing them is necessary on GPU architectures.


Right, I could have been clearer that I was restricting my comments to general purpose CPUs.


As a minor note, LDAR and STLR are poorly named and are documented as providing multicopy atomicity in addition to acquire-release semantics and are the recommended way to implement C11 SeqCst.

Some of the below is almost certainly incorrect, apologies in advance.

Your example is more or less the Independent Read Independent Write (IRIW+syncs) litmus test. The short version is you don't need to flush any caches to get IRIW to always obey sequential consistency.

Originally, ARMv8 was non-multicopy atomic, but AFAIK there was never any non-multicopy atomic implementation, and a revised spec requires multicopy atomicity. This matters in terms of thinking about how sequential consistency for IRIW is implemented, but it works either way.

The core idea is that the DMB ISH ensures that the value that the first reads in threads c and d receive are visible to all threads. With multicopy atomicity this is free - e.g. if Thread C reads true from X, then all other threads will also read true.

(-> below indicates a sequenced before relationship)

So you get

Thread C: read X as true -> dmb ish -> read Y as false

Thread D: read Y as true -> dmb ish -> read X as false

However, you get some extra edges. If either thread reads X as false, then X as true is not visible to any thread. So you get

(Thread C reads y as false) -> (Thread D reads Y as true)

(Thread D reads x as false) -> (thread c reads x as true)

These are enough to produce a cycle that forbids the z=0 case.

Essentially, threads A and B have to make sure that once any thread sees its write, all threads see its write. (Or alternatively the dmb ish on thread C and D have to ensure that future memory accesses from that thread don't proceed until the prior read is consistent to all cores).


Thanks for the in-depth answer!

> Essentially, threads A and B have to make sure that once any thread sees its write, all threads see its write

Gotcha. So that's where cache snooping comes in and why that's sufficient to handle the case. I think I see it now. Again, thanks for breaking it down for me. So it appears the only thing necessary is for the DMB ISH to ensure that all cache-snoop messages are "processed" before proceeding.


> I don't really need to benchmark to see which one is faster.

I know that iterating an Array is radically faster than iterated a LinkedList. But I also know this because I’ve seen the difference first hand. Countless times over the years in fact. I even have a bit of a gut feeling for how much slower it can be.

This case I don’t have the same level of experience. I’m sitting in an airport on my tablet right now. So I can’t throw any quick benchmarks together.

Suffice to say I very much would like to see an actual for reals comparison. Because I haven’t dealt with this before and even if I’m convinced one way is faster seeing is believing! Performance benchmarks are almost always surprising.

Edit: from Reddit here is a different spsc-ringbuffer that may be more to your liking! https://github.com/polyfractal/bounded-spsc-queue


If you have spare time, the Pikus talks at CPPCON talk about the performance hits associated with various primitives.

https://www.youtube.com/watch?v=9hJkWwHDDxs

This is the one you want, with benchmarks associated with everything. Of course, benchmark on your own system too, and learn hardware performance counters (!!). This sort of stuff gets into nanosecond-level measurements, so learning to use the hardware profiler of your CPU really, really helps to see these details.

Honestly, because of how intricate the L1 / L2 caches can be on a modern system... learning hardware performance counters while you're measuring benchmark code / benchmark tests like this is helpful.


It needs to be slow, because each core will need to coordinate with all other contending cores.

No matter what, the system must ensure all cores in the system have a coherent view of the cache line in question. So there must be communication involved.

For example in MOESI protocol [0], it's not enough to have the cache line in exclusive state, you need to have it in owned state. So the core needs to broadcast to all other participating (contending) cores.

You can simply saturate CPU internal core to core ring buses. Even worse between CPU sockets (multiple separate physical chips), because there's rather limited capacity to carry the coherency traffic.

https://en.wikipedia.org/wiki/MOESI_protocol


> spinlock=false; // Don't even need an atomic here actually

No, you'll need a memory barrier there. On x86, sure you get away for free, because x86 memory model says stores are never reordered with other stores.

On practically all other systems you'll need equivalent of C++ atomic "memory_order::release" to prevent stores from being reordered with other stores.

Otherwise you're in for a surprise. Without a (half) barrier, it'll be buggy on ARM, POWER, MIPS, etc.

Edit: I see you added the barriers. Sneaky. :)


Yes, good point.

The obscure details are always hard to remember... that particular example is x86 specific.

I probably should edit in the memory barriers.

> Edit: I see you added the barriers. Sneaky. :)

You reminded me about them. Lol. Thanks for the reminder :-)


Actually, the barriers matter also because evil compilers like to mess with reordering loads and stores.

If you write:

  a = 5;
  b = 4;
... without any barriers compiler can actually generate code that performs "b = 4" first. Even if the barriers get compiled into nothing, they can still affect the generated code load/store order.

I know you know this, but just to make sure even those people reading this later who think their software is only going to run on x86 ever need to watch out.

I think std::atomic / std::memory_order are pretty useful abstractions to get these things right.


Hmm... are you sure your unlock barrier is in the right place? ;-)

Edit: I see you fixed that too.


Uhhh... now it is :-)

This stuff is always hard...


If you still wanted to fine tune it a bit further, you could have used a WeakAtomicExchange (?). Performs better on some systems, and a strong one is not required here.

On x86, it's of course same.


Ha! I think I'm finally correct against you :-)

"AtomicExchange" is just a swap. It is NOT a compare-and-swap, so there's no Weak vs Strong version. A pure-swap is probably faster to implement than a compare-and-swap in microcode, but I admit that I've never actually tested this theory.


Indeed, oops! Brains in pattern matching mode. :-)

Then again, I think XCHG (has implicit atomic LOCK, no need for LOCK prefix) is kinda x86 only thing.

> pure-swap is probably faster to implement than a compare-and-swap in microcode, but I admit that I've never actually tested this theory.

I don't think there's practical performance difference, because inter-core cache coherency traffic is going to dominate by far.


For others that are curious, there's an example of lock XCHG spin lock for amd64 here:

https://en.m.wikibooks.org/wiki/X86_Assembly/Data_Transfer

Ed: well technically x86 I guess - not sure if there's improvements to be made for amd64?


That example is based on "lock cmpxchg" (x86 compare-and-swap). Not on atomic swap, like what "xchg" based one [0] above does.

Someone else [1] also had a good point how "xchg" can cause more contention, because it generates stores. If loads were used for waiting instead, the cache line could have remained in much cheaper shared state on the waiting core.

But regardless benchmark on as many system types as you can get your hands on. No matter how well you know the architecture and you read on the internet, real software systems and CPUs are sometimes rather unpredictable beasts.

[0]: https://news.ycombinator.com/item?id=20098978

[1]: https://news.ycombinator.com/item?id=20099764


Hm, so the above:

while (AtomicExchange(spinlock, true) != false) hyperthread_yield(); // loop until you grab false.

Does do a cmp (the while construct) - but only the XCHG is atomic. While the other example does both CMP and XCHG as an atomic "unit"/op?


First of all, I think "AtomicExchange" will return previous value of "spinlock". First parameter ("spinlock" here) is actually a pointer to "spinlock", not the value itself. Second parameter is what should be swapped with the memory address.

So it atomically stores true to "spinlock" and returns whatever value it contained previously. While-loop is comparing against the previous value, what was swapped from memory.

Unless "spinlock" wasn't false in the first place, the loop can only terminate when the other thread eventually stores false to "spinlock". After that point, one of the waiting threads will suddenly swap that false into true and notice the previous value was false, meaning the lock was successfully acquired.

So until the other thread releases it by storing false to "spinlock", there'll be just endless swaps of true.

I wouldn't recommend using a spinlock like this, because it probably causes a lot of unnecessary and avoidable cache coherency traffic [0] between all of the contending CPU cores.

[0]: Unless, of course, CPUs contain some instruction stream pattern matching to detect this situation (swapping in same value the memory address contained before) and optimize the endless stores away. Wouldn't surprise me at all if such optimizations existed on modern x86 silicon...


Thank you. Still a little unclear on how the code maps to actual x86_64 assembler though. There's generally no "return" value - so I'm curious what the machine code would cmp on.

I suppose one puts the value to write in a registry, then if the XCHG succeed, the "return" value will be in the registry. And the only way to detect success is if the value has changed (which is OK, overwriting a value with the same value... Is a logical noop...).

I suppose I could try something like the above, or:

http://www.cplusplus.com/reference/atomic/atomic/exchange/

And look at the output...


See my quick completely untested idea above in thread how it'll probably be in x86[-64] assembler.


Note: moved this from thread leaf here, because it broke the page formatting.

Can't check any reference now, but I think the whole while loop is roughly:

  loop:
    mov eax, 1 ; "true"
    ; edi points to &spinlock
    xchg [edi], eax 
    ; eax is AtomicExchange retval
    test eax, eax
    jnz loop ; while(eax != false)
    ; we've acquired the lock
X86-64 version is same, except the address is probably in a 64-bit register, like RDI. Although this would also work in 64-bit mode, if the pointer just happens to be in the lower 4 GB.


> Well-written spinlocks are absurdly fast

They're fast if there's low contention. If you have high contention then they burn CPU cycles needlessly and can be significantly slower than something more clever like Windows' CriticalSection.


64 bytes is the false sharing boundary of modern Intel CPUs. Heavily contentious variables will produce different synch behavior if they are not in the same 64 byte segment. This is documented in Intel's CPU manuals.


> Using a spinlock / mutex would improve performance if we're only talking about 2 threads.

... but that wouldn't be lock-free would it?


> ... but that wouldn't be lock-free would it?

Why do people do lock free?

1. Lockfree is sometimes easier to understand -- This case it is not. They've added a "watermark" variable to the queue, so its more difficult to reason about now.

2. Lockfree can be faster -- This is the case most people hope for. But it is far more difficult to do in practice. Spinlocks are surprisingly fast.

There's kind of an "implicit" assumption that you're going lock-free because of benefits of #1 (easier) or #2 (faster). But in this case... they built a lock-free data structure that is HARDER to understand and very clearly slower.


> 2. Lockfree can be faster -- This is the case most people hope for.

it's not about being faster, it's about never returning control to the OS if you want to achieve soft-real time in one of your threads (a common application being audio processing where you have to consistently fit as much as you can do every 1.33 milliseconds for the more demanding users).


Spinlocks don't return to the OS.

Mutexes are spinlocks for roughly 1000ish cycles on both Linux and Windows. If you're in-and-out before the 1000 cycles are up, the typical mutex (be it a pthread_mutex or Windows CriticalSection) will never hit kernel code.

If you take more than 1000ish cycles, then yeah OS takes over as a heuristic. But most spinlock code won't take more than 50 cycles.


How many cycles do you think it will take if the other thread has locked the spinlock and then died or was scheduled out...


Both Linux and Windows detects these cases with pthread_mutex (Linux spinlock+mutex) and CriticalSection objects (Windows spinlock+mutex).

Can you tell me an OS where what you say is a legitimate concern?

To answer your question: about 4000 cycles on Windows if I'm remembering off the top of my head. I think that's the default. Roughly 1-microsecond.


Detects these cases... and does what?

What can the OS do for you if your consuming thread has for example taken a lock and is now stalled on a page fault triggering IO from spinning disk that could take seconds to finish? Nothing that I'm aware of - you just have to wait until it runs again and releases the lock.

That's why people write lock-free - they don't want a stall or death of the one thread to stall other threads. Deadlock, lovelock, priority inversion.

Read the Wikipedia page for a starting point of motivations for lock-free.


Then schedules out the other thread?

I mean, what happens if the writer thread stalls out in the lock free case? The reader thread runs out of things to do and is also going to be scheduled out (eventually).

Like, that's the OS's job. To remove tasks from the "runnable" state when they stall out.

> That's why people write lock-free - they don't want a stall or death of the one thread to stall other threads.

Erm, if you're saying that the reader thread will be highly utilized... that's not likely the case if the writer thread in this "lock free" implementation stalls out.

Lets say the writer-thread hits a blocking call and is waiting for the hard drive for ~10 milliseconds. What do you think will happen to the reader-thread? In the "lock-free" implementation talked in this blog.


> I mean, what happens if the writer thread stalls out in the lock free case? The reader thread runs out of things to do and is also going to be scheduled out (eventually).

Yeah it keeps running, exactly as you say! It can keep working on jobs already on the queue independently.

Are you arguing that you you personally don't think it's worth doing lock-free for various practical reasons? Or arguing that's never why anyone does it?

Because the latter is simply false - I've done it for the latter reasons professionally both in academia and industry so there's one data point for you, and the literature talks about this use-case extensively so I know other people do as well.


> Are you arguing that you you personally don't think it's worth doing lock-free for various practical reasons? Or arguing that's never why anyone does it?

No. Lock-free is great. When it is done correctly.

The code in this blog post was NOT done correctly. It only works in the case of 1-reader + 1-writer, so the "advantages" you're talking about are extremely limited.

You have to keep in mind the overall big picture. Lock-free code is incredibly difficult to write and usually comes with great cost to complexity. However... sometimes Lock-free makes life easier (okay, use them in that case). Sometimes, lock-free can be faster (okay... use them in that case... if speed is important to you).

But in many, many cases, locked data-structures are just easier to write, test, and debug.


> It only works in the case of 1-reader + 1-writer

this is what is needed 99.9% of the time. e.g. look at the number of users of this C++ SPSC queue : https://github.com/cameron314/readerwriterqueue

or of the Boost one : https://github.com/search?q=boost%3A%3Alockfree%3A%3Aspsc_qu...


Yeah, but the "queue-of-queues" is a more traditional structure familiar to people with lock-free algorithms.

I don't have time to do a serious code review of everything... but immediately upon looking at that code, its already clearly written with better quality: with acquire/release fences being used in (seemingly) the correct spot (just as an example).

Like, the real issue I have here is the quality of the code that is being demonstrated in the blog post. A lock-free of queues-of-queues implementation makes sense in my brain. What's going on in the blog-post ultimately doesn't make sense.


> Erm, if you're saying that the reader thread will be highly utilized... that's not likely the case if the writer thread in this "lock free" implementation stalls out.

That does not mean that the reader thread has nothing to do. e.g. the most common case is that your reader thread will produce an output, and your writer thread sends messages to change how this output is generated - but even if there's no message, the reader thread has to keep going at a steady rate.


There is a third reason to go lock free: You cannot take locks or wait.

I've had some cases in embedded devices and OS development where taking a lock is not possible.

One example might be read buffer in an embedded device that is on one end being fed from an interrupt routine and on the other being consumed by a normal processor routine.

The interrupt routine can happen at any time the normal processor routine happens and it cannot wait or yield to it since there is only one core and no task scheduling. It needs to write to the buffer now, regardless of what the consumer is doing.

Lock-free has some desirable properties, it can run concurrently on single-core machines in presence of interrupts and it's safely reentrant.


3. Lock-free cannot deadlock when one of the threads dies in the middle of an operation.

4. Lock-free will not stall all other threads if one of the threads stalls in the middle of an operation.


3. If your writer thread dies randomly, then your reader thread has nothing to do.

3 part 2: If your reader thread dies randomly, then your writer thread eventually fills the buffer and has nothing to do.

Both cases turn into a deadlock with this "lock-free" buffer. You can't afford to have either thread die, regardless of the "lock free" or "locked" implementations.

> 4. Lock-free will not stall all other threads if one of the threads stalls in the middle of an operation.

Same thing here. The queue will eventually fill up if there's a stall, causing the reader (and / or the writer) threads to stall as well.

So in practice, neither #3 nor #4 actually help you out.


> Same thing here. The queue will eventually fill up if there's a stall, causing the reader (and / or the writer) threads to stall as well.

There's an important difference between "queue full" and "queue not full, but locked". In the former case, you can either block (waiting for the queue to drain) or discard the data; in the later case, you must wait for the lock to be released before deciding. That is, if you allow data to be discarded when the queue is full, a lock-free queue can also be stall-free.

(It might appear that a queue using a lock can also be stall-free, since the amount of work it does while locked is bounded; however that fails once the thread holding the lock is preempted by either a context switch or a hardware interrupt. To really avoid stalls in that case, you have to either block preemption and interrupts, or perhaps you might be able to make it work with some priority inheritance scheme. And yeah, I know that the hardware can "stall" while reading or updating a shared variable, but the amount of work the hardware does in that case is bounded.)


> (It might appear that a queue using a lock can also be stall-free, since the amount of work it does while locked is bounded; however that fails once the thread holding the lock is preempted by either a context switch or a hardware interrupt. To really avoid stalls in that case, you have to either block preemption and interrupts, or perhaps you might be able to make it work with some priority inheritance scheme. And yeah, I know that the hardware can "stall" while reading or updating a shared variable, but the amount of work the hardware does in that case is bounded.)

If you absolutely need non-blocking locks, then Linux has pthread_mutex_trylock() and Windows has TryEnterCriticalSection(). (Remember: Linux pthread_mutexes are spinlocks in most cases. And Windows CriticalSections are also spinlocks in most cases).

> That is, if you allow data to be discarded when the queue is full, a lock-free queue can also be stall-free.

And in practice, how is that different from a locked queue that simply discards data when pthread_mutex_trylock() fails? In both locked and non-locking implementations, you are assuming that the programmer can live with discarded data (or at very least: nonblocking is more important than the discarded data).

Its trivial to write an implementation with pthread_mutex_trylock() to be stall-free when you can simply discard data.


> And in practice, how is that different from a locked queue that simply discards data when pthread_mutex_trylock() fails?

The difference is that it only discards data when the queue is full. If you discard data when a trylock fails, you'll randomly discard data just because the other side happened to be looking at the queue at that moment. In the common situation where the consumer is being fast enough to always drain the queue before it fills, it's the difference between not discarding any data versus discarding an unpredictable subset of the data.


> In the common situation where the consumer is being fast enough to always drain the queue before it fills

There is literally no application I'm aware of where you can hold this assumption, if only because of the existence of swap memory and the way OSes page tasks in and out.

If you accept data-loss in the case of "poor performance on the consumer", you are tacitly accepting data-loss whenever that thread hits the "wrong" memory address that happened to have been swapped to disk, and now has to wait 10ms+ before that task can continue.

Its effectively random data loss from the perspective of the user.

------------

EDIT: I had a tangent here, but I probably shouldn't try to write a lock-free queue in one try in a stream of consciousness post. I just deleted it all, I probably had a subtle bug somewhere anyway.

------------

Seriously: look at the code in the blogpost. I'm not 100% convinced its bug free. The consumer-thread is still interacting with the "write" and "watermark" variables, and those two variables are NOT being updated atomically by the producer queue.

I think they manage to squeak by due to SeqCst, but its a very difficult "proof". I'm not entirely sure I can "prove" that the blogpost's code is correct.

But hey, if you think the code is fine, convince me why the following is NOT a bug.

    === In the "producer" thread ====
    
    buffer.write.store(buffer.write.load() + write_len)
Mind you, I don't think its a bug. But its really difficult for me to prove it. So I'm curious, why are you confident in their methodology when the above is such a difficult line to grok? And there's not a single compare-and-swap talked about in the entire blogpost (so they're definitely not using the "standard" methodology... to lock-free code)


To be fair the watermark is there to deal with the DMA problem, where you need a sequential chunk of memory.

Why not just create a ping-pong buffer setup though would be another question if the goal is DMA


If there is exclusive writing and exclusive reading, then it either isn't lock free or it isn't correct.


That's only an issue if there are multiple producers. That code works fine when there's a single producer, since nothing else will modify the "write" value between the load and the store. Given that this is in Rust, it's easy to encapsulate this data structure in a way that makes the compiler enforce this invariant.


The article stipulates there is a single reader and a single writer. The actual API enforces this constraint.

https://docs.rs/spsc-bip-buffer/0.1.1/spsc_bip_buffer/fn.bip...


Just because you're using atomics doesn't mean that you've created an atomic transaction.

Raymond Chen of Microsoft has a much simpler example of that:

https://devblogs.microsoft.com/oldnewthing/20040915-00/?p=37...

Yeah, its not easy. But this is NOT an easy subject by any stretch of the imagination.

Having had to debug many difficult race conditions before, I can say that anything related to concurrency is extremely difficult to reason about, and needs to be done with a lot of careful thought.


Although dragontamer's complaints are mostly correct, I think he's being a little uncharitable and is largely missing the point.

Yes, writing lock-free code that's generic, that supports many-to-many reads/writes, and that's correct on all architectures is hilariously hard, and most people should not try to implement these from scratch at work. Other approaches can be more performant, and can have acceptable trade-offs for most use cases.

Are there issues with this one? Maybe. As dragontamer stated multiple times, this stuff is pretty hard to reason about.

HOWEVER, this ring buffer was designed to run on embedded systems. As usual, the constraints are often a little bit different when you're writing software for a microcontroller.

As an example, let's imagine you have periodic bursts of data coming in on a serial port at 2 Mbps, and your microcontroller is running at 32 MHz. Also, your serial port only has a 4 byte hardware buffer, and the cost of missing a byte is... I don't know, sending an endmill through the wall.

You can absolutely make this work, and you can even formally verify that you'll always meet timing, but you're going to have a really hard time doing this if you also have to analyze every other piece of code that will ever try to acquire that lock. A single-producer, single-consumer, lock-free queue with fixed message sizes can be implemented correctly in about 20 lines of C. It has a lot of annoying constraints, and you still have to use atomics correctly, and you still need a memory barrier, and don't even think about resetting the queue from either thread, and... (etc).

But if you're careful, a queue like this can make an otherwise impossible task relatively manageable. I can no longer count the number of times a construct like this has saved a project I was involved with.

Would it automatically run correctly and at full performance on a Xeon running Linux? Fuck no. But that's not the point.

The desirable quality of lock-free queues for embedded systems is correctness, not performance.


Can you give the 20 lines of example code? Would really love to see some modern atomics code.


Okay, not quite 20 lines, but something like this:

https://gist.github.com/ohazi/40746a16c7fea4593bd0b664638d70...

I've dressed it up to look like C11, but vendor specific macros from your microcontroller support library are more common. I think you can relax some of the memory ordering too, but I don't remember the details off-hand.

Also, sometimes you can have a modified version of queue_push / queue_pop where a DMA handles the memcpy and then you only have to update the pointers in the cleanup interrupt (along with configuring the next or next-next DMA if the queue isn't full / empty).


The atomic_signal_fence in queue pop/push seems suspicious. Unless you are doing something like asymmetric fences or actually synchronizing with signal handlers you probably want atomic_thread_fence. Still I'm not sure why you want a fence at all there. Publishing is done by the atomic_store to queue_rd and wr which already include (I'm being fast and loose here) the equivalent of a seq_cst fence.


The reason you want a barrier there is to make sure the memcpy and the pointer update don't get reordered so the other thread doesn't have a chance to access that block before it's ready.

You don't actually need any synchronization with the other thread, which is why I think this can work instead of atomic_thread_fence. But you could be right, anyone using this should check this carefully.

Edit: I think you're right. If it is needed, it should be a thread fence. I'm not 100% clear on whether the pointer update takes care of this, so it may not actually be needed.


The block is 'published' when the store to the write pointer is performed. That's a store release which will have an implicit barrier before the store.


Okay yeah, you're definitely right. I didn't realize that the stdatomic load/stores were by default seq_cst. The libraries I'm familiar with have macros for standalone atomic reads and writes, but you have to place any barriers manually. If you were using a library like that, you'd place them in between the memcpy and the pointer update, but here you get that for free.


>This is the story of how Andrea Lattuada (PhD student at ETH Zurich) and James Munns (from Ferrous Systems) designed and implemented (two versions!) of an high-perf lock-free ring-buffer for cross-thread communication. If any of those words look scary to you, don't fret, we'll explain everything from the basics.

This is not the message I want to see introducing such a complex topic. Writing correct lock-free code is incredibly hard, and you're not going to get it right or even understand it from a single post. I've done a bit of reading on this subject, and I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.


Pikus does it. I made a post about it. https://news.ycombinator.com/item?id=20096907

The talk is 3.5 hours, about as long as you'd expect one to be for "writing a concurrent lockfree queue". Its an incredibly difficult topic, but I'm convinced that Pikus's implementation is actually correct.

Although, maybe there's a subtle bug somewhere. And Pikus is REALLLY stretching the definition of a queue to make his implementation work. (This queue may sometimes function like a stack, but that's okay because there's no way for the end programmer to figure it out without a global lock)

---------

EDIT: Oh, I guess you wouldn't consider it "truly lock free", because he has a global lock in the case of the rare situation of a full queue. He uses double-checked lock pattern for that case (and causes an expensive, but temporary, global lock on the data-structure) and basically says "This situation only happens once per hour, so its probably not bad".

So Pikus's implementation is "lock free mostly", but still falls back to a global spinlock in exceptionally rare corner cases.


Oh, that's actually a different lock-free talk from Pikus than I expected. I haven't it watched yet. Thanks!

I probably wrote my original comment a little too quickly. Yes, it's possible to write useful mostly lock-free data structures, but it's hard enough that I consider anyone claiming to have done so in a way that beats the equivalent locked structure (as you pointed out above) to be making an extraordinary claim that requires extraordinary evidence and serious consideration of the difficulty involved.


> I probably wrote my original comment a little too quickly.

Nah. Anyone who actually understands this subject recognizes that a truly lock-free concurrent queue is probably one of the most difficult problems in all of Comp. Sci.

Even if you're using absolutes, its basically warranted because of the difficulty of the problem. No one will blame you for talking about how hard this problem is, even with a bit of hyperbole being used


> This is not the message I want to see introducing such a complex topic.

I don't see the need for such gatekeeping. I think telling people over-and-over how "incredibly hard" something is will tend to discourage people from learning about that thing. I think that's bad.


Warning about complexity and gatekeeping are not the same thing.

First-year programmer says "I can't wrap my head around Python, I'll write my own language!"

A response of "You're an idiot, don't bother, people smarter than you have tried and failed" is gatekeeping, and shitty.

A response of "There is a lot about this area you do not understand, be aware that it is a lot harder than you seem to think it is" is cautionary, but not gatekeeping.


You have moved the discussion away from the original points. What started this is a sentence in the article encouraging readers to continue because the author will explain the concepts behind potentially new words. The poster I replied to complained about that sentence because it does not convey that the topic is complex.

But almost all important topics are complex, and all education has to start somewhere. My claim is that insisting we emphasize the complexity rather than assuring the understandability is gatekeeping because it discourages new entrants and promotes an aura of mystery around the subject.


There are plenty of hard lock free algorithms, but an SPSC ring buffer is not exactly rocket science and it is a good introduction on the topic.

Why do you think that a GC is required for a queue? It is certainly very helpful to implement complex node-based intrusive concurrent data structures, but for queues, even node based and even MPMC, as long as each message is produced by one thread and consumed by another one (i.e. the typical use case), single ownership works just fine.

edit: I see you are talking about hidden locks in the allocator. That's fair, but even GC at some point might need to hit the kernel to increase the heap size.


>I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language.

I'm guessing because of the "magical free that doesn't lock" assumption in several papers?

There are some ideas in the Rust community (mostly centered around audio), most of them use garbage collection schemes that prevent either the producer or consumer thread from locking. It's tricky and easy to break.


Yeah, that's pretty much exactly it. Every "lock-free queue without garbage collection" I've seen is either broken, relying on hidden locks (most commonly in free), or implementing garbage collection while calling it something else.


I'm biased because I've worked on these things, but imo implementing garbage collection for a very limited set of data structures that are all deterministic or explicit in their deallocation is much easier to deal with than a general purpose GC for an entire program. It allows you to guarantee lock/wait free behavior on particular threads, which you don't get with a GC'd language as a whole (depending on the GC of course).


Well, then your mind should be blown as the lock free queue provided in this article works without either locks or garbage collection.

The trick here is that this queue has a fixed capacity so it can allocate everything upfront. You are right that infinite capacity queues can be a bit tricky to implement without violating your conditions. (In many cases you don't actually need an infinite capacity queue though ...)


No, the trick here is that they limit the queue to a single producer thread and single consumer thread. With only two threads operating on the queue, it's extremely unlikely that this is faster than a locked version of the same structure.


As long as it is using sequentially consistent writes is not going to be faster than a two lock queue, as, at least on x86, an uncontended sc write is exactly as expensive an uncontended spin lock acquire (and a spin lock release is very cheap).

On the other hand, a SPSC ring buffer can be implemented purely with acquire loads and release stores, and if one takes care of avoiding reading recently written cachelines when not necessary it can have higher throughput than a spin-locked queue.


Which performance is probably bad.


> [...] I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.

There's no dynamic memory allocation going on in this particular lock-free queue, so I don't see how your concern is relevant here?


After reading the full article, they solve a vastly more restricted (although perhaps still useful) problem than the title implies. Their design assumes a single producer and single consumer and is trivially broken if that's not the case. I don't have time to write a convincing set of benchmarks to prove it, but I very much doubt that this is faster than a locking implementation of the same structure.


I can attest to the basic lock-free spsc queues being far faster than a basic spinlocking version of the same queue, on xeon-class X86 at least.

LOCK'ed instructions (one might use lock xchg or lock cmpxchg) perform a full memory barrier, which has quite a few performance implications if there are any outbound loads/stores. Further, if the cache line containing the spinlock is in anything except an exclusively-held state on the writing core (it will usually be on a different core or in a shared state), simply acquiring the lock will stall for ~25-30ns+.

On the other hand, a simply spsc lock-free queue can hit <10ns per message with a latency about equal to the core-core communication latency.


I implemented such a buffer in 2006. It was used for fast message passing from user space to the kernel. When a message was too large to fit at the end of the circular buffer, a zero-length message was placed into the remaining space; that indicated "wrap to the beginning". It's so obvious, it must have been implemented numerous times by others before me.

If I were to implement the same thing again, I would just split the messages and use scattered writes (struct iovec) to process them. I cannot remember exactly, but I think the only reason the messages were linear was just for the kernel thread to be able to write them out in a single call.

Oh wait, now I remember; I think the buffers were linear due to the production side: sprintf being used to produce some of the message payloads, and that requiring a linear buffer. But of course it's possible to support both wrapped messages and the "tail space not used" indicator.


I was wondering about having a special message to indicate the end of the buffer while reading the article, so thanks for pre-emptively answering my question!

Another thing I was wondering about is using a simple buddy allocator with the ring buffer being just a list of pointers. There would be a second ring buffer to send memory back to the producer to be freed, and you would statically allocate enough memory so fragmentation doesn't affect the particular application---you know each message has a limited lifetime, for instance. (The allocator would not be shared between different ring buffers.)

Then again, I don't see any benefits to doing this.


You can make wrapping writes look contiguous by mapping the same memory twice, one after the other. This is done in slice-deque such that the entire buffer can be viewed as a contiguous slice, regardless of the start/end positions.

https://crates.io/crates/slice-deque


This trick may break ordering and cache aliasing rules on many architectures. If you're dipping into this kind of thing, it needs per-architecture whitelisting.


I'm not sure about the ordering part, but about "cache aliasing rules" - can you give some details? This situation can be trivially set up via mmap, so if that breaks it would seem to be a problem?

I understand that aliasing is an issue when it comes to hardware design, but the practical need to support systems that allow this to occur means that all major architectures I'm aware of support transparent V->P aliasing, either by having their caches physically indexed, effectively physically indexed like VIPT or some strategy where V is used to look up the way in L1 but misses due to aliasing will be resolved in the L2 (i.e., aliasing "works" but is slow), or some other similar strategy.

I would be curious what systems don't work this way.

Note that I'm talking only about data - systems definitely have all sorts of rules about modification of instruction-containing pages.


I have also seen this trick called the magic ring buffer.


Also important, not using a power of two sized ring is leaving lots of optimization on the table.


Eh, maybe/maybe not. This implementation requires a family big boy CPU with a really MMU; I'd be willing to bet that most of the extra latency of the divides are hidden in the memory access latencies.


There are two implementations described in this blog post and one of them is specifically designed to run on microcontrollers.


CPUs are pipelined though so you can easily imagine many slots being operated on at once. Perhaps a better metric is the relative cost of the slot assignment against the the cost of the read/write operation. Not to mention the fact that elements may be in L1, in which case the penalty of additional cycles may actually be felt.


Reminds me of the LMAX architecture.

Slides of a talk I gave about LMAX

https://www.slideshare.net/Stephan.Schmidt/lmax-architecture...


What is considered the bible on lock free programming?


I don't think there is one, yet. 'Multiprocessor Programming': https://www.amazon.com/Art-Multiprocessor-Programming-Revise... has some algorithms, though.


The Art of Multiprocessor Programming. by Maurice Herlihy and Nir Shavit is an excellent start.


I'm guessing this still uses locks, but at the CPU/cache level instead of inside the algorithm.


I've always thought "lock free" was a misnomer for this kind of thing because they use multi-stage atomic instructions which obviously require some kind of implied locking. I'd probably call them "non-blocking".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: