A lock-free ring-buffer with contiguous reservations (2019)

jamesmunns · 2024-02-29T15:38:01 1709221081

Oh hey, one of the authors of this post here (James), happy to answer any questions.

This post has been discussed here a couple times, but AMA :)

edit, the most commented version of this post was the original:

https://news.ycombinator.com/item?id=20096946.

This is what I'm up to these days:

agentultra · 2024-02-29T15:57:18 1709222238

Has there been a formal specification/model that we can check? I love circular buffers but I'm curious how we know that the design is correct with respect to the stated properties.

I did a bit of link diving to the various blog posts and sites but haven't been able to find it. Would be nice, if it exists, to have it front and centre.

jamesmunns · 2024-02-29T16:02:17 1709222537

As far as I know, it hasn't been formally verified.

Andrea Lattuada (https://andrea.lattuada.me/) is more likely to have done work in his implementation of the algorithm than I did on mine.

I have run the testing with various dynamic analysis tools, but for sure this isn't formal verification. If someone is interested in doing it, happy to chat with them!

anonymousDan · 2024-02-29T22:40:07 1709246407

Shouldn't be that hard to at least model check it in TLA+ I would have thought (albeit potentially more complex if trying to account for weak memory).

SergeAx · 2024-03-01T04:13:36 1709266416

Please excuse me if my question is dumb, I have little understanding of Rust. You declaring your implementation "lock-free", but are using Atomic* Rust primitives at the same time. Those are using CPU atomic instructions, if I googled correctly. Those, in their turn, are using CPU locking mechanism. Turns out that you just shifted locking from language to CPU, right?

DannyBee · 2024-03-01T05:09:22 1709269762

It's a reasonable question, because lock-free is a confusing term.

Lock-free here does not mean "without any form of synchronization primitive". That is impossible. Some synchronization must occur.

Instead, it is a term of art that means:

1. Thread failure can't cause other threads to get blocked. 2. Forward progress in the algorithm is guaranteed.

Basically: threads can't block each other and death of threads does not screw things up.

https://en.wikipedia.org/wiki/Non-blocking_algorithm

CPU atomics often work by blocking (they have to) but they are non-interruptible instructions and there is a fixed maximum execution time.

(So you can still guarantee forward progress if you use them).

mrcode007 · 2024-03-01T04:20:44 1709266844

Yes, this is correct and the most overlooked aspect and reason for the misnomer. Atomics imply locking at the cpu. Depending on the CPU the lock happens either for the entire memory bus (pre Intel P6 on x86) or as part of snoop disable bits on relevant cache lines in the snooping protocol

DannyBee · 2024-03-01T05:10:08 1709269808

It's not a misnomer, it's just a bad term of art. When it was coined folks were well aware that atomics locked at the CPU level.

mrcode007 · 2024-03-01T06:00:57 1709272857

“In discussing the question, he used to liken the case to that of the boy who, when asked how many legs his calf would have if he called its tail a leg, replied, ” Five,” to which the prompt response was made that calling the tail a leg would not make it a leg.”

gpderetta · 2024-03-01T11:25:58 1709292358

The "locking" happening on the CPU level is very different from the software level locking, as it is guaranteed by the hardware to be bounded (technically bus arbiters are not guaranteed to complete in finite time, but they are astronomically unlikely to do so).

Fiahil · 2024-02-29T17:06:35 1709226395

I used the same approach while designing a lock-free bounded broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only Vec). It's quite easy to do because it's bounded. However, I could not find a way to make it both unbounded and efficient.

Any ideas ?

duped · 2024-03-01T05:43:16 1709271796

It's very hard to beat locking when the queue needs grow. It's the performance statistics - you're going to grow more often as the app warms up, reach a steady state, and only need to grow under heavier than expected load. And in that case you probably aren't going to see performance dominated by time spent waiting to acquire a write lock.

The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.

That's why locking in unbounded queues is popular, imo. It's not actually inefficient, and the alternatives are arcane enough to avoid. No one is going to understand or trust the code anyway.

It's worth mentioning that "lock free" means "one task does not prevent other tasks from making progress" and in the case of a bounded queue, you can trivially accomplish that by busy-waiting when the queue is full. This isn't appropriate if you need consumers to receive events in the order in which they happen, but you can kind of fix that using an atomic counter (to paraphrase Mike Acton, if you have a safe counter, you have a safe queue) and time stamp events/sort on which ones are consumed.

Fiahil · 2024-03-01T11:28:01 1709292481

> The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.

Yes and I do think they are fundamentally incorrect or they cannot be translated to the log use case.

I also agree that, performance-wise, the lock make little difference. In fact my old benchmarks tend to point that locks were MORE EFFICIENT than atomics on ARM64 (MBP M1), for example. It's more like a fun little exercise, and also to confirm that I'm not completely dumb and that the problem is not solvable with the simple use of 2 atomics and a counter.

paholg · 2024-02-29T19:18:55 1709234335

I'm no expert here, but I wonder if a linked list of bounded logs would work well.

So, everyone has a pointer to the current one, and there's an atomic pointer to the next.

When the log fills up, you allocate a new one and set the next pointer to it using compare_and_swap. If someone beat you to it, you can walk the list and add yours to the end so that the allocation isn't wasted.

This way, most of the time you're using the efficient bounded log.

Fiahil · 2024-03-01T11:16:47 1709291807

in the case of a bounded log, readers are expected to give an offset at which they want to perfom the read (kafka-like).

So the linked list would involve going through all the links and following end tail refs/pointers. It would make reading O(n) and that's a Nope.

However, you could imagine having a Vec index that contains a ref to all allocated inner-logs, and query the index first in order to obtain the buffers' location. That works, but then the index has to go through a lock (either a RWLock or a mutex) as the CaS operation isn't enough if we get to the end of the index and it needs to be reallocated. It's fine, and I think that's the most appropriate solution.

PS : In fact, there is a sweet spot where you'd like to have a semi-unbounded log. If your index is big enough to contain something like 4Md entries, you'd probably end-up splitting the log in several pieces for archiving and performances purposes. Loading the log (fully or partially) from disk efficiently is then more important than having a real unbounded log. Then you would not necessarily use a lock and could CaS in the index.

jamesmunns · 2024-02-29T17:17:47 1709227067

I am not sure! Most of the data structures I design are for embedded systems without allocators. On the desktop, I mostly defer to others.

I've used tokio's broadcast channel quite a bit before, but it is also bounded. After talking to Eliza from the tokio project, I'm fairly convinced that unbounded queues are a scary thing to have around, operationally :).

But again - this is a bit out of my actual expertise!

jayshua · 2024-03-01T02:12:08 1709259128

Not sure if it could be extended here, but I've seen a lock free hash map that supported lock free reallocation by allocating new space and moving each entry one by one, either when the entry is accessed or in a separate thread concurrently. Accessing an entry during the reallocation would check the new region first, and if not found check the old region. Entries in the old region would be marked as moved and once all entries were moved the old allocation could be freed.

Fiahil · 2024-03-01T11:19:09 1709291949

For the (un)bounded logs, the whole concept reside on the fact that the log isn't going to move once allocated, and that references to an item will never be invalided until the end of the program

cangeroo · 2024-03-01T08:43:46 1709282626

ELI5?

I see a lot of critique in the previous (2019) thread, but no summary in key points. What are the reasons that this is hard?

In multiprocessor systems, are memory writes not guaranteed to be sequential? e.g. can data be written out-of-order, after the synchronization bit is written?

Or is it more about the use case, that it is optimized for minimal number of instructions, e.g. avoiding additional memory reads? (e.g. by writing data to the buffer, instead writing storing pointers)?

Or is it that you're trying to use constant memory (irrespective of number of threads)?

Because to me, it seems like a trivial problem to solve, if you have sequential writes, store data separately from the queue, and may linearly scale memory with number of threads.

jamesmunns · 2024-03-01T13:24:57 1709299497

ELI5: We have a shared chunk of memory, with one sender, and one receiver. We want this to work without an allocator.

Instead of pushing and popping one byte/thing at a time (inefficient, a lot of overhead), we want to have the ability to push or pop big chunks at a time.

We also don't want to pay to copy these chunks in or out, because we're going to have the hardware do it for us autonomously. This means we have to be very careful that the "one side" doesn't look or touch a chunk of memory currently being used by the "other side".

The ideal flow is that we have the CPU get some "writing space", the CPU asks DMA to fill it up, when DMA is done, the CPU marks it as "ready to read". Then at some later time, the CPU (maybe another core or thread) asks for some "ready to read space", the CPU then either uses it, or maybe asks DMA to copy it somewhere else. Then the CPU marks that data as "successfully read", and the space can be recycled the next time something wants to send.

The trick is how you coordinate access to shared memory between two CPUs, correctly, and using the least overhead, so neither side prevents the other from reading or writing, as long as theres a little slack in the line.

Certhas · 2024-03-01T09:20:36 1709284836

Seems to me that the main critique was from someone who didn't understand the specified problem, and complained that this didn't solve a more general one.

(... and probably wasn't aware that the assumptions made in the specification can be encoded in the API signature in Rust. This wouldn't be possible in C, which is why C forces you to solve the harder problem or rely on your users to not accidentally misuse the data structure.)

el_pollo_diablo · 2024-03-01T09:10:11 1709284211

There is this snippet of code in the article:

  if buffer.len.saturating_sub(buffer.write.load()) >= write_len {
    // not shown: check `read` to make sure there's enough free room
    buffer.watermark.store(buffer.write.load() + write_len);
    buffer.write.store(buffer.write.load() + write_len);
  }

I would have to look at the implementation to know for sure, but that part looks incorrect to me. Suppose that the writer has placed a watermark strictly before the end of the buffer, wrapped around, and is about to write a new message; meanwhile, the reader has not reached the watermark yet (and therefore, has not wrapped around), but it has made enough progress to leave room for the writer's new message. In that case, we have write <= write + write_len < read <= watermark < len. As written, it would seem that the snippet above would incorrectly update watermark, whose value is still needed by the reader.

It seems to me that watermark need not be atomic anyway: it is owned by the writer when read <= write, and by the reader when write < read. Whoever wraps around implicitly passes the ownership to the other thread. More precisely:

* In the initial state, and until the writer reaches the end of the buffer, read <= write, and the writer owns watermark, whose value is irrelevant. In particular, the reader makes sure not to overtake write, and knows that it must not access watermark.

* When the writer needs to wrap around, it first sets the watermark (possibly at len, if the hole is empty), then updates write. With its usual release memory ordering constraint, the atomic store to write transfers to the reader the ownership of the written messages, but also of watermark.

* Eventually, the reader notices that write has wrapped around, and then uses watermark to know where to stop reading. Meanwhile, the writer may fill the beginning of the buffer, carefully not overtaking read - 1, and never touching watermark.

* When the reader finally wraps around, the atomic write to read transfers to the writer the ownership of the consumed message slots, but also of watermark, and we are back in the first phase.

IOW, watermark has the same owner as the last element of the array.

What do you think? What have I missed?

iwasanewt · 2024-03-02T17:13:23 1709399603

Hi James! Does `https://onevariable.com/blog` have an rss/atom feed?

loeg · 2024-02-29T16:25:13 1709223913

> The safe thing to do here is to always choose Ordering::SeqCst, "sequential consistency", which provides the strongest guarantees. ... This is often good enough in practice, and switching to a weaker Ordering is only necessary to squeeze out the last bit of performance.

If you're going to write lock-free algorithms using atomics, the least you can do is learn about ordering semantics and use the correct abstract semantics for your design's actual needs. It is much easier to do it at design time than to try to figure out if it is safe to relax SeqCst later. (This is one of the major flaws of C++ std::atomic's default SeqCst semantics.) If you aren't going to bother understanding ordering semantics, it is unlikely you can write a safe lock-free algorithm anyway. (It's really hard to do!)

jamesmunns · 2024-02-29T17:20:09 1709227209

Back then, there weren't as good references for explaining atomic ordering, and the blog post had gotten long enough. Mentioning SeqCst was a bit of a cop out, though both Andrea and I didn't end up using SeqCst past the inital impl anyway.

Today I would have just linked to https://marabos.nl/atomics/, Mara does a much better job of explaining atomics than I could have then or now.

hinkley · 2024-02-29T18:30:37 1709231437

Back then? Do you mean 2019? Or a different “then”? Because there was plenty of material in CS about this subject even in 2010. Java was wrestling with this twenty years ago, and databases long before that.

anonymous-panda · 2024-02-29T19:20:20 1709234420

I think the hard part of it is that x86 only has one atomic ordering and none of the other modes do anything. As such, it’s really hard to build intuition about it unless you spend a lot of time writing such code on ARM which wasn’t that common in the industry and today most people use higher level abstractions.

By databases, do you mean those running on DEC Alphas? Cause that was a niche system that few would have had experience with. If you meant to compare in terms if consistency semantically, sure but there’s meaningful differences between database consistency semantics of concurrent transactions and atomic ordering in a multithreaded concept.

Java’s memory model “wrestling” was about defining it formally in an era of multithreading and it’s largely sequentially consistent - no weakly consistent ordering allowed.

The c++ memory model was definitely the first large scale adoption of weaker consistency models I’m aware of and was done so that ARM CPUs could be properly optimized for since this was c++11 when mobile CPUs were very much front of mind. Weak consistency remains really difficult to reason about and even harder to play around with if you primarily work with x86 and there’s very little tooling around to validate that can help you get confidence about whether your code is correct. Of course, you can follow common “patterns” (eg loads are always acquire and stores are release), but fully grokking correctness and being able to play with the model in interesting ways is no small task no matter how many learning resources are out there.

gpderetta · 2024-02-29T19:42:06 1709235726

Nit: x86 has acquire/release and seq_cst for load/stores (it technically also has relaxed, but it is not useful to map it to c++11 relaxed). What x86 lacks is weaker ordering for RMW, but there are a lot of useful lock free algorithms that are implementable just or mostly with load and stores and it can be a significant win to use non-seq-cst stores for this on x86

vlovich123 · 2024-03-01T03:16:08 1709262968

I would have to imagine you mean x86-64 right? I would imagine 32bit x86 doesn’t have those instructions?

I’m also kind of curious if a lot of modern code compiled to x86 would see consistency issues running on old CPUs before TSO was formalized (like a p2 multiprocessor server).

loeg · 2024-03-01T04:19:35 1709266775

32-bit x86 has many of the same instructions, including cmpxchg8b (in models dating to the 90s).

haberman · 2024-02-29T19:57:17 1709236637

Indeed there is different code generated by seq_cst for stores. Though for loads it appears to be the same: https://godbolt.org/z/WbvEcM83q

loeg · 2024-03-01T04:21:41 1709266901

Re: the godbolt example, note that release semantics are not meaningful for load operations.

> If order is one of std::memory_order_release and std::memory_order_acq_rel, the behavior is undefined.

https://en.cppreference.com/w/cpp/atomic/atomic/load

gpderetta · 2024-02-29T20:12:41 1709237561

Yes, seqcst loads map to plain loads on x86.

foobiekr · 2024-02-29T20:27:12 1709238432

X86 might but devices connected to it in embedded world have had to be very very aware of this stuff since the 90s.

vlovich123 · 2024-03-01T03:09:22 1709262562

Embedded devices did not necessarily use the c++ memory model, and definitely not in the 90s and were highly likely in order CPUs to boot with no crazy compilers and thus atomics didn’t matter too much anyway (volatile was sufficient). They had a weaker memory model maybe but at the same time multi threading on embedded did not really exist as it was only being introduced into the industry with any real seriousness around that time (threading on Linux started to shake out around the mid 90s).

foobiekr · 2024-03-01T04:02:57 1709265777

SMP systems were widely in use in the 1990s, but you’re correct the dual core MIPS was 2003ish in emedded.

jamesmunns · 2024-02-29T18:43:24 1709232204

I meant 2019, and there weren't any materials that I would consider as clear and well defined as Mara's linked docs explaining the different orderings used by C, C++, and Rust (Relaxed, Release, Acquire, AcqRel, and SeqCst).

I'm very sure there were discussions and teaching materials then, but none (that I was aware of) focused on Rust, and something I'd link to someone who had never heard of atomic ordering before.

nyanpasu64 · 2024-03-04T07:42:26 1709538146

Chapter 7 doesn't test if performing loads on a reader thread makes a writer thread any slower to perform relaxed writes. Does a concurrent reader slow down writes or not (the LMAX Disruptor relies on variables with one writer and many readers, and claims it's fast), and does it depend on the CPU's cache coherence protocol?

jcelerier · 2024-02-29T17:26:35 1709227595

> It is much easier to do it at design time

Is it? I always worked the second way (starting from seq_cst and then when the core design matured enough and didn't change for a few months, trying to see what could actually be relaxed). I'd be very afraid that in the first case, you start with say relaxed semantics somewhere, them you change the design because the requirements changed, and now you have to go again through all the atomic operations to make sure the assumptions all still hold.

jamesmunns · 2024-02-29T17:31:18 1709227878

Back when this post was written, I would have agreed with you. But Mara's book makes a good case for this:

https://marabos.nl/atomics/memory-ordering.html#common-misco...

  More importantly, when reading code, SeqCst basically tells the reader:
  "this operation depends on the total order of every single SeqCst operation
  in the program," which is an incredibly far-reaching claim. The same code
  would likely be easier to review and verify if it used weaker memory ordering
  instead, if possible...
  
  It is advisable to see SeqCst as a warning sign.

gpderetta · 2024-02-29T17:35:37 1709228137

If you change the design of a lock free algo you very likely have to go through all the atomic operations to make sure that all assumptions hold anyway.

derefr · 2024-02-29T20:55:52 1709240152

> If you aren't going to bother understanding ordering semantics, it is unlikely you can write a safe lock-free algorithm anyway.

I think the implicit suggestion here is that the target audience for this abstraction is actually two separate developers:

1. A developer who doesn’t know much about locking semantics, but can write the simple-but-slow case correctly.

2. Another developer, much later on — probably a senior SRE type — who can revisit this code and optimize it with full understanding of the constraints.

(This might also be the same person, years later, with more experience under their belt.)

The benefit of the way this library is designed, is that the second developer doesn’t have to completely rewrite everything just to optimize it. The naive dev’s code has already been forced into an “almost but not quite” lock-free mold by the design of the library’s API surface.

haberman · 2024-02-29T19:46:28 1709235988

I've never actually seen a production lock-free algorithm that uses SeqCst. I have a hard time even imagining an algorithm where SeqCst is the right choice.

It seems like SeqCst was chosen as a default to reduce the surprises and gotchas around lock-free programming. But lock-free programming is inherently tricky; if you're trying to avoid surprises and gotchas you should probably use a mutex.

senderista · 2024-03-01T02:30:22 1709260222

It is the right choice whenever you need linearizability. I can't believe you've never seen a (correct) production lock-free algorithm impl that used SeqCst. Many lock-free algorithms require SeqCst for correctness. Here's a trivial example: hazard pointers. Any thread publishing its hazard pointer must use a StoreLoad barrier (equivalent to SeqCst) to ensure any GC thread scanning the publication list sees its hazard pointer, before it deallocates pointers in the limbo list that didn't appear in the scan. MemSQL actually wrote a blog post on a nasty bug in their database arising from their use of AcqRel for this operation instead of SeqCst: https://www.singlestore.com/blog/common-pitfalls-in-writing-....

loeg · 2024-03-01T04:47:12 1709268432

This blog post leaves a lot to the imagination.

> Here’s the case that broke our stack:

  Thread 1, in preparation for a Pop operation, reads the head of the stack.
  Thread 1 writes the current head to its hazard pointer (using only release semantics, 
  which are weaker than sequentially consistent semantics).
  Thread 1 reads the head of the stack again.
  Thread 2 removes head from the stack, and passes it to the garbage collector thread 
  (using sequentially consistent memory semantics).
  The garbage collector scans the hazard pointers,\ and (because the assignment was not 
  done with sequentially consistent memory semantics) is not guaranteed to see thread 
  1’s hazard pointer pointing to the node.
  The garbage collector deletes the node
  Thread 1 dereferences the node, and segfaults.

My interpretation is that with release semantics for the store, the 2nd read (load) in Thread 1 is actually allowed to be reordered before the release store to the hazard pointer. But they are not very explicit about it.

> So if thread 2 removing the pointer happens first, thread 1 will see a different value on its second read and not attempt to dereference it.

Thread 1 will see thread 2's remove even with release semantics for that store -- the store has a data dependency on the first load; they cannot be reordered.

> If thread 1 writes to its hazard pointer first, the garbage collector is guaranteed to see that value and not delete the node.

Yeah, this must be it. Thread 1 fails to notice the GC happened while it was writing its HP because its second load actually happened before the HP store.

Folly's hazard pointer implementation uses a release store to update the hazard pointer (here: reset_protection()), but uses some sort of SeqCst barrier between the store and the 2nd load (with acquire semantics): https://github.com/facebook/folly/blob/main/folly/synchroniz...

senderista · 2024-03-01T05:20:01 1709270401

Yes, the store to the HP entry must happen-before both the second load of the global pointer in the publishing thread and any load of the HP entry in another (GC) thread. (The first load + store + second load emulates an atomic memory-to-memory copy of the global pointer to the HP entry.)

gpderetta · 2024-03-01T07:22:43 1709277763

You certainly need a #storeload to update the hazard pointer, but do you really need seq_cst? Is a total order of all updates really necessary? Wouldn't, say, an acq_rel exchange be sufficient?

I need to read that article, it seems interesting.

loeg · 2024-03-01T16:50:03 1709311803

To set a HP on Linux, Folly just does a relaxed load of the src pointer, release store of the HP, compiler-only barrier, and acquire load. (This prevents the compiler from reordering the 2nd load before the store, right? But to my understanding does not prevent a hypothetical CPU reordering of the 2nd load before the store, which seems potentially problematic!)

Then on the GC/reclaim side of things, after protected object pointers are stored, it does a more expensive barrier[0] before acquire-loading the HPs.

I'll admit, I am not confident I understand why this works. I mean, even on x86, loads can be reordered before earlier program-order stores. So it seems like the 2nd check on the protection side could be ineffective. (The non-Linux portable version just uses an atomic_thread_fence SeqCst on both sides, which seems more obviously correct.) And if they don't need the 2nd load on Linux, I'm unclear on why they do it.

[0]: https://github.com/facebook/folly/blob/main/folly/synchroniz...

(This uses either mprotect to force a TLB flush in process-relevant CPUs, or the newer Linux membarrier syscall if available.)

gpderetta · 2024-03-01T19:06:43 1709320003

Ah, yes it uses an asynchronous barrier trick. Basically it promotes the compiler barrier to a full barrier. It makes sense for throughput if one side is executed significantly more often than the other like in this case. The cost is latency spikes.

loeg · 2024-03-01T21:10:49 1709327449

I don't yet understand how the other side promotes a compiler barrier to a full barrier, but I'll take your word for it and try to read more about it later. :-)

gpderetta · 2024-03-06T10:28:03 1709720883

"promoting" is just a short-hand, what happens is a bit more complicated.

First of all remember that the corresponding membar on the collector would only synchronize with the last membar executed on the mutator thread. So if the collector executes significantly less often the mutator, all the executed mutator membars except the last one is wasted overhead. So ideally we want to elide all mutator membars except those that are actually needed.

What actually happens is that the collector thread remotely executes some code (either directly via a signal or indirectly via mprotect or sys_membar) on the mutator thread that executes the #StoreLoad on its behalf. Sending the required interprocess interrupt is very expensive, but this is ideally offsetted by only doing it when truly required.

You can model[1] this as a signal handler executing on the mutator thread that issues an actual atomic_thread_fence to synchronize with the collector, while the mutator itself only need a atomic_signal_fence (i.e. a compiler barrier) to synchronize with the signal handler.

[1] even if this is not necessarily what happens when using mprotect or sys_membar.

senderista · 2024-03-07T16:33:39 1709829219

Thanks for explaining the details. In my application though (millions of TPS executing in an MVCC system) I just can't wait possibly tens of ms for membarrier(2) to return: way too much garbage could accumulate in the meantime. From my POV this isn't much better than EBR in terms of low/deterministic latency (it is better in terms of fault-tolerance, if you have out-of-proc clients, but I can reliably detect crashed clients anyway via a Unix domain stream socket and clean up their garbage for them).

senderista · 2024-03-02T02:29:33 1709346573

There's a good in-depth discussion here: https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-r...

loeg · 2024-03-02T17:42:06 1709401326

Yeah, kinda, although this is pretty much the entire discussion on how the asymmetric fence works:

> The slow path can execute its write(s) before making a membarrier syscall. Once the syscall returns, any fast path write that has yet to be visible (hasn’t retired yet), along with every subsequent instruction in program order, started in a state where the slow path’s writes were visible.

(I've actually seen this blog post before, but did not remember this part in detail.)

ajross · 2024-02-29T17:45:46 1709228746

Completely agree, though the more iconoclastic corrolary that goes unspoken there is that putting The Final Word on memory ordering semantics into programming language standards was a terrible mistake.

Memory ordering is a hardware behavior. It needs to be specified at the hardware level, and hardware vendors have been very mixed on clarity. And more importantly, lockless algorithms (that rely on memory ordering control) are really, really hard, and demand clarity over all things. And instead we're crippling those poor programmers with nonsense like "sequentially consistent"[1] or trying to figure out what on earth "consume" means[2].

x86 does this pretty well with their comparatively simple model of serializing instructions. Traditional ARM ISAs did only a little worse by exposing the interface as read/write barrier instructions. Everyone else... meh.

But if you really want to do this (and you probably don't) do the analysis yourself at the level of ISA/architecture/hardware, cite your references, and be prepared to handle whatever portability works is needed on your own.

[1] A statement about desired final state, not hardware behavior!

[2] Nothing, on any hardware you will ever use. Don't ask.

gpderetta · 2024-02-29T17:58:35 1709229515

On the contrary, fixing the memory model on a widely used language like C++ forced hardware vendors to get their act together and provide more rigorous memory model explanations. For example intel went from Processor Ordering to TSO, and arm started offering explicit acquire/release operations.

Java had the opportunity as well, but by initially only providing a stronger, mostly sequentially consistent MO, the hardware vendors managed to get away a little longer.

ajross · 2024-02-29T18:03:23 1709229803

I don't think that's the case with Intel at all, the events are off by a decade at least; do you have a blog or something to cite there? I'd be curious to read it. And as for ARM, "explicit acquire/release" is objectively less informative and harder to reason about than the xxxSB instructions were (and yes, I've used both). ARM went backwards to accommodate C++'s nonsense.

Again, the language standard writers aren't remotely the experts here, the hardware designers are. That C++ invented its own metaphors instead of listening to the experts is a bug, not a feature.

gpderetta · 2024-02-29T18:17:12 1709230632

The hardware designers were involved on the standardization process. I don't have citations at hand, I think most of the mailing lists were the discussion re the c++ MO happened have been lost, but (as a lurker trying to learn this stuff) I was following the process closely.

The question was, given PO, whether it was at all possible to recover sequential consistently on intel either with mfence or a lock xchg, given the possibility of IRIW. Intel then updated their MO to exclude IRIW, de facto standardizing on TSO.

This was early 2000s. I think both ARM and IBM published revisions to their architecture clarifying details around the same time.

This spawned a set of academic papers that proved the correctness of the agreed mapping of the C++ memory model to those architecture s.

ajross · 2024-02-29T18:35:51 1709231751

> The hardware designers were involved on the standardization process.

That sounds cyclic then. You're saying that Intel's SDM was ambiguous[1] (which it was) and that it was sorted out as part of a standardization process. I'm saying that it doesn't really matter what the SDM said, it mattered whether or not you could reliably write lockless code on x86 using the hardware and docs available at the time, and you could. And further, I'm saying that the standard ended up making things worse by perpetuating arguments like this about what some buggy English text in an SDM said and not about actual hardware behavior.

[1] In ways that AFAICT didn't actually impact hardware. I found this, which is probably one of the papers you're citing. It's excellent work in standards-writing, but it's also careful to note that the IRIW cases were never observed on hardware. https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...

gpderetta · 2024-02-29T20:09:43 1709237383

It didn't impact hardware because intel hadn't taken advantage yet of the additional latitude offered by their original model. Then they closed the hole and they guaranteed no IRIW[1]. But in the meantime if your algorithm was susceptible to this reordering, there was no written guarantee that an mfence would fix it. But most importantly as the model was informal and not self consistent, there was no possibility to write formal proofs of correctness of an algorithm or run it against a model checker.

[1] in practice this means no store-forwarding from sibling hyper thread store buffers, something that for example POWER allows and is observed in real hardware.

bfrog · 2024-02-29T23:41:59 1709250119

C++ is a bug that can’t be fixed

vacuity · 2024-02-29T23:59:12 1709251152

Although, like many bugs, it's also a feature.

magnat · 2024-02-29T16:37:13 1709224633

> A typical serial port configuration is "115200 8N1", which means 115,200 baud (or raw bits on the wire per second), with no parity, and 1 stop bit. This means that for every data byte sent, there will be 8 data bits, 1 unused parity bit, and 1 stop bit, to signal the end of the byte, sent over the wire. This means that we will need 40 bits on the wire to receive a 4 data bytes.

8N1 means there is 1 start bit, 8 data bits and 1 stop bit (10 bits total), not 8 data bits, 1 unused parity bit and 1 stop bit (also 10 bits total).

jamesmunns · 2024-02-29T17:22:52 1709227372

Yep, good catch! That's a whoops in my explanation. I don't work at FS any more, so I'm not sure I could PR that change, but you're definitely right :)

rdtsc · 2024-02-29T15:32:10 1709220730

My favorite ring buffer structure is like the one described in https://www.gnuradio.org/blog/2017-01-05-buffers/

> It asks the operating system to give it memory-mappable memory, and map twice, “back to back”. Blocks then get called with pointers to the “earliest” position of a workload within this memory region – guaranteeing that they can access all memory they were offered in a linear fashion, as if they’d actually be dealing with hardware ring buffers.

It imposes limitations on hardware and OS support in a way, but I think it's pretty neat.

This also used by the kernel BPF ring buffer:

https://www.kernel.org/doc/html/latest/bpf/ringbuf.html

> ...data area is mapped twice contiguously back-to-back in the virtual memory. This allows to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory

dist1ll · 2024-02-29T15:45:23 1709221523

Unfortunately that means the chunks are only contiguous in virtual memory. So it won't work with the DMA use case mentioned in the article, which requires contiguous physical addresses.

But it's still a nice trick, I like it when people get creative with HW features.

jnwatson · 2024-02-29T16:13:40 1709223220

Theoretically, the same trick could be used to double map addresses coming from an external master via an IOMMU.

speed_spread · 2024-02-29T17:22:15 1709227335

It is unlikely that cheap microcontrollers where DMA is most helpful will have an IOMMU.

gpderetta · 2024-02-29T17:40:07 1709228407

But the hardware only needs to see on copy of the duplicated memory and you can let it deal with the wraparound. The software can use the convenience of the double mapping.

gpderetta · 2024-02-29T17:37:55 1709228275

AKA the Magic Ring Buffer. It extremely convenient not having to deal with split messages, especially if you support variables size payloads.

scottlamb · 2024-02-29T17:45:48 1709228748

This is a cool trick, and iirc there are a few Rust crates that implement it, including slice-deque.

...but I think there are a few significant downsides, even in userspace:

* The most obvious one: you have to write `unsafe`, non-portable code. The Linux, macOS, and Windows implementations are totally different from each other.

* The setup and teardown of each ring is a bit expensive (few system calls). For a regular allocation, the malloc implementation typically caches that for you. Here you have to do your own pooling if you might be frequently creating them.

* Using whole pages, and two mappings per ring, is wasteful in terms of not only RAM (often no big deal) but also TLB space (which often turns into significant CPU usage). If you just allocate 32 64 KiB rings from the standard allocator, on x86-64 you might be talking about a single 2 MiB huge page mapping. If you do this instead, you're talking about 1024 4 KiB page mappings.

duped · 2024-02-29T20:34:53 1709238893

Any real, production ready ring buffer should be using unsafe. I would consider anything that doesn't to be a toy.

In Rust it's basically impossible to do this without MaybeUninit. You could use Option, but then you're paying a massive cost for a very easy to write and audit chunk of unsafe code.

scottlamb · 2024-03-01T00:24:08 1709252648

I don't think it's useful to consider "uses `unsafe`" as a boolean. A one-line `unsafe` around `MaybeUninit::assume_init` isn't the same as an `unsafe` module per platform wrapping VM operations.

Also, it's not that crazy for a byte-oriented buffer to start with a `vec![0u8; N]` (cheap) and not need `MaybeUninit` at all. Probably doesn't buy you that much though; you still want to be careful to not leak previous bytes.

Also, you might be missing the point of my comment if you're responding to one word of "the most obvious [downside]" and not the other bullets...

duped · 2024-03-01T05:51:37 1709272297

It's not an attack on the wording, but the correctness of your first bullet point. `unsafe` is appropriate for the initialization of a ring buffer in Rust. That's true for using `mmap` or anything in "pure" Rust using the allocator API to get the most idiomatic representation (which can't be done in safe or stable Rust). It's not one line. It's also not platform dependent, the code is the same on MacOS, Linux, and Windows the last I tried it.

The rest of the bullet points are issues with scaling, which sure, are valid. But if your bottleneck is determined by the frequency at which channels get created or how many exist then I would call architecture into the question. A ringbuffer is a heavy hammer to synchronization problems. It's appropriate in many, but not many times in the same application, in my experience.

This last month I've written a lock-free ring buffer to solve a problem and there's exactly one in an application that spawns millions of concurrent tasks.

scottlamb · 2024-03-01T15:45:12 1709307912

> It's not an attack on the wording, but the correctness of your first bullet point. `unsafe` is appropriate for the initialization of a ring buffer in Rust. That's true for using `mmap` or anything in "pure" Rust using the allocator API to get the most idiomatic representation (which can't be done in safe or stable Rust). It's not one line. It's also not platform dependent, the code is the same on MacOS, Linux, and Windows the last I tried it.

We're not talking about the same thing then.

I'm talking about setting up a mirrored allocation, as in this code here: <https://github.com/gnzlbg/slice_deque/tree/master/src/mirror...> or this code here: <https://github.com/mmaroti/vmcircbuf/blob/743e1f3622641ee281...> or this code here: <https://github.com/kalamay/vmap-rs/blob/b8a5f9c819b4dd41a5b7...> It is absolutely platform specific, in three different crates implementing the same idea...

Yes, most ring buffer implementations feature a little bit of `unsafe` code. No, it doesn't make sense to say "I have a tiny amount of `unsafe` already, so adding more has no cost."

> But if your bottleneck is determined by the frequency at which channels get created or how many exist then I would call architecture into the question. ... This last month I've written a lock-free ring buffer to solve a problem and there's exactly one in an application that spawns millions of concurrent tasks.

A lot of applications or libraries are written to support many connections, and you don't necessarily know when writing the code (or even when your server accepts an inbound connection) if those connections will be just cycled very quickly or will be high-throughput long-lived affairs. Each of those probably has a send buffer and a receive buffer. So while it might make sense for your application to have a single ring buffer for its life, applications which churn through them heavily are common and completely valid.

Sometimes folks do go a bit crazy with this. I question whether this XML parser <https://github.com/tvbeat/rapid-xml/blob/7dbffab5a25487221b2...> really needed a mirrored ring buffer implementation here, and for small documents the cost of its setup more than outweighs the significant effort they put into making this parser fast with SIMD operations. But then again, they probably optimized it for large documents, and naturally it supports both...

> A ringbuffer is a heavy hammer to synchronization problems.

While the implementation in the ferrous-systems.com article is a "high-perf lock-free ring-buffer for cross-thread communication", cross-thread synchronization isn't the only use for ring buffers. They're great for connections' send and receive buffers, as mentioned above. None of the ring buffers in the crates I linked to are `Sync`; they're meant to be used by one thread at a time.

signa11 · 2024-03-01T06:44:15 1709275455

ah the vm-trick ! have used it current (and previous) places of work to great effect.

rdtsc · 2024-03-01T16:42:40 1709311360

Yup. The other "official" name of it is The Magic Ring Buffer as a sibling comment mentioned.

monocasa · 2024-02-29T15:55:27 1709222127

> Contended writes from multiple threads on the same memory location are a lot harder for the CPU's cache coherence protocol to handle

FWIW, those are the same location according to most cache coherency protocols, since cache coherency generally works on the cache line level. You'd want to split the two contexts to their own cache lines.

stefanha · 2024-02-29T20:02:28 1709236948

Another cache optimization trick some ring-buffer implementations use is to keep a shadow copy of the read or write pointer to avoid frequently fetching the other context's cache line. The latest version of the read pointer is only needed when the writer catches up with their shadow copy and vice versa.

anonymousDan · 2024-02-29T22:41:26 1709246486

Yes this is absolutely crucial for performance.

dnedic · 2024-02-29T22:32:43 1709245963

Bipartite buffers are amazing and criminally underused. For those looking for C and C++ implementations you can check out my libraries: lfbb and lockfee: https://github.com/DNedic/lfbb, https://github.com/DNedic/lockfree (although lockfree contains more data structures as well)

samsquire · 2024-02-29T16:20:18 1709223618

I tried to write a lock free ringbuffer with weak atomics, I haven't proved it right with TLA+ yet but I started writing a model in it. I use tagging to avoid the ABA problem.

they're all on https://github.com/samsquire/assembly, i tried to write multiple disruptor with multiple consumers, then one with multiple producers then one with multiple consumers AND multiple producers, inspired by LMAX Disruptor. (There's files for each of them and table in the repo. it's not proven yet!)

the contention on the same memory address (the read/write index) is the thing that seems difficult to address.

One thing I've learned about thread safety:

I think if you have thread-owned values then you can be thread safe with a simple semaphore, providing that you have unique, DISTINCT values for each thread.

If you have two threads that have this in a hot loop in parallel:

  // thread 0                        
  if buffer[x].available == 1:
    // do stuff
    buffer[x].available = 0           

  // thread 1
  if buffer[x].available == 0:
    // do stuff
    buffer[x].available = 1

Due to causality, no matter the interleaving, thread 0 owns the buffer[x].available and body of the if statement when it is 1 and thread 1 owns the body of the if statement buffer[x].available when it is 0.

The CMP is a cheap mutex with distinct valued memory locations.

Even though thread 1 is writing to buffer[x].available and thread 0 is writing to buffer[x].available it doesn't matter because the causality is mutually exclusive. There is no interleaving of buffer[x].available = x because of the if statement.

The buffer[x].available = 0 will never run while buffer[x].available is equal to 0 overwriting or causing a data race when setting buffer[x].available to 1. So the second line cannot happen in parallel.

I need to write a TLA model to assert its safety.

If you have more than 2 threads, then you need different tokens to provide admissability to the if statement.

Remember to use compiler memory barrier

   asm volatile ("" ::: "memory");

so you don't need volatile struct values.

gpderetta · 2024-02-29T17:53:08 1709229188

The problem here is the

  [1] // Do stuff
  [2] Buffer[X]. available = Y

There is no explicit nor implicit ordering between 1 and 2, so the compiler or cpu can reorder them. You need a release barrier between the two.

Also while most CPUs preserve the control dependency, not all do (famously Alpha), and certainly not compilers. You would need a consume barrier, except that c++11 consume is only for data dependencies and unimplemented anyway.

Edit: with the correct barriers in place, you can prove correctness by similitude to two size 1 SPSC queues used to exchange a mutual exclusion token, with the added quirk that as the queues are never used at the same time, they can actually be physically colocated in memory.

samsquire · 2024-02-29T18:39:24 1709231964

Thank you for you and for sharing your knowledge gpderetta, appreciated, TIL.

loeg · 2024-02-29T16:28:11 1709224091

> The buffer[x].available = 0 will never run while buffer[x].available is equal to 0 overwriting or causing a data race when setting buffer[x].available to 1.

In particular, because loads and stores of the same variable cannot be reordered out of program order. Once your algorithm involves other variables, you would (likely) need to be a little careful about loading/storing with acquire/release semantics to prevent reordering other accesses relative to this protocol.

> Remember to use compiler memory barrier

I would highly recommend using the language atomic types (and barriers if truly needed) instead of gcc inline assembly syntax.

samsquire · 2024-02-29T16:32:10 1709224330

Thanks for your reply. This subject is still new to me.

My understanding of that syntax is that it is a compiler memory barrier, not a CPU memory barrier because the asm block is empty (no sfence or mfence).

loeg · 2024-02-29T16:44:24 1709225064

Hey, no problem.

> My understanding of that syntax is that it is a compiler memory barrier, not a CPU memory barrier because the asm block is empty (no sfence or mfence).

In C11, you can write compiler-only fences with atomic_signal_fence:

https://en.cppreference.com/w/c/atomic/atomic_signal_fence

(In practice, though, I think it is rare that you actually want a compiler-only fence. Instead, correct use of acquire/release operations prevents reorderings.)

samsquire · 2024-02-29T16:50:03 1709225403

Thank you loeg, I appreciate you and information you brought that TIL.

I've been using a compiler fence to force reloads from memory to prevent -O3 from optimising away my variables/structs changing by other threads and keeping data in registers rather than reloading from memory each time. I saw the volatile recommended against from the Linux kernel programmers.

such as my thread->running == 1 in my event loops for my threads.

https://www.kernel.org/doc/html/latest/process/volatile-cons...

loeg · 2024-02-29T16:53:07 1709225587

> I've been using a compiler fence to force reloads from memory to prevent -O3 from optimising away my variables/structs changing by other threads

I would highly recommend using the language standard atomic primitives instead.

anonymousDan · 2024-02-29T22:47:17 1709246837

Regarding the contention, one thing that's important is to cache a local copy of the shared head and tail variables every time you access them. Then for subsequent operations you can first check the local cached copy to see if you can perform the read or write without needing to check the shared variables.

samsquire · 2024-02-29T16:39:09 1709224749

When you check available, you might have to do it as a (__atomic_load_n(&sender->realend, __ATOMIC_SEQ_CST) and do __atomic_store_n when setting available.

rather than just a plain load.

nyanpasu64 · 2024-02-29T22:35:48 1709246148

>In Andrea's implementation of the lock-free ring-buffer, spsc-bip-buffer, some of the orderings are relaxed for performance. This has the downside that it can introduce subtle concurrency bugs that may only show up on some platform (ARM, for example): to be a bit more confident that everything's still fine, Andrea's has continous integation tests both on x86 and ARM.

It might be worth testing/forking the library to test on Loom (https://github.com/tokio-rs/loom), which can model atomic orderings and check for concurrency errors to some degree (though I last used it years ago). TSAN might be able to check for ordering errors in visited execution traces (though I haven't tried using it in the past).

fritzo · 2024-02-29T15:15:31 1709219731

See also the Java LMAX Disruptor https://github.com/LMAX-Exchange/disruptor

I've built a similar lock-free ring buffer in C++11 https://github.com/posterior/loom/blob/master/doc/adapting.m...

GenericCanadian · 2024-02-29T15:56:43 1709222203

I also wrote an LMAX Disruptor in Crystal: https://github.com/nolantait/disruptor.cr

Here is one in Ruby: https://github.com/ileitch/disruptor

Both languages are quite readable and I've used these to teach the concepts to beginners.

bfrog · 2024-02-29T16:39:25 1709224765

I have to say after looking at various DMA hardware I much much prefer the scatter gather list type than the ring type of DMA.

The entire need of a bipbuffer allocation for DMA then goes way. You can have a simple pool of fixed sized blocks to throw at the DMA. Pretty easily done with a free list.

I do think the implementation here is cool though, and its nice to see some work in this area.

95014_refugee · 2024-02-29T16:52:15 1709225535

To make scatter/gather go fast, you either spend a lot of effort caching descriptor lists for pinned buffers (because you expect to see them often), or heavily optimising your VM's v2p translation machinery, or some combination of the two.

And then you wind up discovering that you the driver writer aren't actually trusted and so you need to insert at least one if not several IOMMUs between the peripheral and the memory(ies) that they may access, managed by another software component in a different address space.

Then someone asks you to make all of this work for clients in VMs.

At which point you start wondering why you didn't just allocate a physically contiguous buffer at startup and copy to/from your client buffers using the CPU...

Sorry for sounding triggered... 8)

bfrog · 2024-02-29T23:47:15 1709250435

No need to apologize at all. I haven’t seen these issues. But I’ve worked with this setup without mmu involvement.

cultureswitch · 2024-03-01T15:44:51 1709307891

I was faced with the same constraints a couple of years ago and came up with an almost verbatim solution. It's an interesting problem where a lot of very subtle bugs can happen, it's good to know that I went with the same solution as people who put a lot more effort into making sure it is correct.

piterrro · 2024-02-29T16:19:13 1709223553

This is great article! Very detailed and explains things on a low-level.

For a more high-level implementation, I just released yesterday a blog post about ring buffer in Golang: https://logdy.dev/blog/post/ring-buffer-in-golang

ChrisMarshallNY · 2024-02-29T15:39:19 1709221159

That watermark is a simple, elegant idea.

I haven't really had the need for that kind of thing, in many years, but I like the idea.

loeg · 2024-02-29T16:39:48 1709224788

Recently (2022) I designed a ring buffer for use as a 'flight recorder' type tracing system. (I.e., there is typically no reader; the writer needs to write over old records without blocking on any reader. If the reader is triggered, it flips a switch that disables the writer temporarily while the buffer contents are persisted.) In that design I subdivided the ring into several subbuffers (~8). Each subbuffer has its own equivalent of a watermark. That way, the valid portion of the ring always starts at the beginning of one of the subbuffers, and the writer could 'free' the next subbuffer worth of space trivially (without having to scan through old contents record by record). (Any write that did not fit in the current subbuffer was advanced to the start of the next one.)

LAC-Tech · 2024-02-29T22:21:08 1709245268

TIL about BipBuffers. I've been struggling with a similar data structure, and to see it already has a name, and a better implementation than what I've been doing, is very welcome.

liquid153 · 2024-02-29T16:17:22 1709223442

Aren't lock free buffers usually just as expensive or more expensive to use as locks.

loeg · 2024-02-29T16:32:56 1709224376

No -- for a lock-free design with a single producer and consumer, it's possible both are typically writing to independent regions of memory. With a lock, both have to write the same cache line to take and release the lock.

zengid · 2024-02-29T16:19:56 1709223596

Not if your program needs to be realtime or near realtime safe. Locks are controlled by the OS typically, and can have non-deterministic latency.

loeg · 2024-02-29T16:34:00 1709224440

Even ignoring mutexes and OS scheduling, plain spinlocks add contention that would not otherwise exist in a SPSC ringbuffer. https://news.ycombinator.com/item?id=39551575

ww520 · 2024-02-29T16:51:43 1709225503

Lock free has two advantages: the checking code can run in user mode and non-contested access is very cheap with just one instruction.

To do it correctly, lock needs to be done in the kernel thus obtaining a lock requires calling into the kernel which is more expensive.

I think you meant the memory barrier for syncing cache is just as expensive as the lock version, which is true.

scaredginger · 2024-02-29T17:15:56 1709226956

Obtaining an uncontested lock absolutely doesn't require calling into the kernel

jcelerier · 2024-02-29T17:30:37 1709227837

How can you give hard guarantees that on Windows, Mac, Linux with the OS and/or libc provided locks?

gpderetta · 2024-02-29T19:45:47 1709235947

If you really really really need such a guarantee, you implement your own.

Otherwise you inspect the implementation, but in 2024 a fast-pathed OS lock is table stakes.

tialaramex · 2024-03-01T01:50:46 1709257846

Rust (which is what we're discussing here) actually doesn't promise this in general. But for the three operating systems you mentioned that is in fact what it delivers because as another commenter mentioned it's table stakes. If your OS can't do this it's a toy OS.

The Windows and Linux solutions are by Mara Bos (the MacOS one might be too, I don't know)

The Windows one is very elegant but opaque. Basically Microsoft provides an appropriate API ("Slim Reader/Writer Locks") and Mara's code just uses that API.

The Linux one shows exactly how to use a Futex: if you know what a futex is, yeah, Rust just uses a futex. If you don't, go read about the Futex, it's clever.

saagarjha · 2024-03-01T09:36:57 1709285817

Because not doing that these days would be malpractice for an OS of that caliber.

CyberDildonics · 2024-02-29T16:24:22 1709223862

No, where are you getting that information?

stmblast · 2024-02-29T15:10:22 1709219422

Awesome article! Bookmarked.

KingOfCoders · 2024-03-01T06:02:10 1709272930

Sounds a little like LMAX architecture.