Oh hey, one of the authors of this post here (James), happy to answer any questi...

agentultra · 2024-02-29T15:57:18 1709222238

Has there been a formal specification/model that we can check? I love circular buffers but I'm curious how we know that the design is correct with respect to the stated properties.

I did a bit of link diving to the various blog posts and sites but haven't been able to find it. Would be nice, if it exists, to have it front and centre.

jamesmunns · 2024-02-29T16:02:17 1709222537

As far as I know, it hasn't been formally verified.

Andrea Lattuada (https://andrea.lattuada.me/) is more likely to have done work in his implementation of the algorithm than I did on mine.

I have run the testing with various dynamic analysis tools, but for sure this isn't formal verification. If someone is interested in doing it, happy to chat with them!

anonymousDan · 2024-02-29T22:40:07 1709246407

Shouldn't be that hard to at least model check it in TLA+ I would have thought (albeit potentially more complex if trying to account for weak memory).

SergeAx · 2024-03-01T04:13:36 1709266416

Please excuse me if my question is dumb, I have little understanding of Rust. You declaring your implementation "lock-free", but are using Atomic* Rust primitives at the same time. Those are using CPU atomic instructions, if I googled correctly. Those, in their turn, are using CPU locking mechanism. Turns out that you just shifted locking from language to CPU, right?

DannyBee · 2024-03-01T05:09:22 1709269762

It's a reasonable question, because lock-free is a confusing term.

Lock-free here does not mean "without any form of synchronization primitive". That is impossible. Some synchronization must occur.

Instead, it is a term of art that means:

1. Thread failure can't cause other threads to get blocked. 2. Forward progress in the algorithm is guaranteed.

Basically: threads can't block each other and death of threads does not screw things up.

https://en.wikipedia.org/wiki/Non-blocking_algorithm

CPU atomics often work by blocking (they have to) but they are non-interruptible instructions and there is a fixed maximum execution time.

(So you can still guarantee forward progress if you use them).

mrcode007 · 2024-03-01T04:20:44 1709266844

Yes, this is correct and the most overlooked aspect and reason for the misnomer. Atomics imply locking at the cpu. Depending on the CPU the lock happens either for the entire memory bus (pre Intel P6 on x86) or as part of snoop disable bits on relevant cache lines in the snooping protocol

DannyBee · 2024-03-01T05:10:08 1709269808

It's not a misnomer, it's just a bad term of art. When it was coined folks were well aware that atomics locked at the CPU level.

mrcode007 · 2024-03-01T06:00:57 1709272857

“In discussing the question, he used to liken the case to that of the boy who, when asked how many legs his calf would have if he called its tail a leg, replied, ” Five,” to which the prompt response was made that calling the tail a leg would not make it a leg.”

gpderetta · 2024-03-01T11:25:58 1709292358

The "locking" happening on the CPU level is very different from the software level locking, as it is guaranteed by the hardware to be bounded (technically bus arbiters are not guaranteed to complete in finite time, but they are astronomically unlikely to do so).

Fiahil · 2024-02-29T17:06:35 1709226395

I used the same approach while designing a lock-free bounded broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only Vec). It's quite easy to do because it's bounded. However, I could not find a way to make it both unbounded and efficient.

Any ideas ?

duped · 2024-03-01T05:43:16 1709271796

It's very hard to beat locking when the queue needs grow. It's the performance statistics - you're going to grow more often as the app warms up, reach a steady state, and only need to grow under heavier than expected load. And in that case you probably aren't going to see performance dominated by time spent waiting to acquire a write lock.

The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.

That's why locking in unbounded queues is popular, imo. It's not actually inefficient, and the alternatives are arcane enough to avoid. No one is going to understand or trust the code anyway.

It's worth mentioning that "lock free" means "one task does not prevent other tasks from making progress" and in the case of a bounded queue, you can trivially accomplish that by busy-waiting when the queue is full. This isn't appropriate if you need consumers to receive events in the order in which they happen, but you can kind of fix that using an atomic counter (to paraphrase Mike Acton, if you have a safe counter, you have a safe queue) and time stamp events/sort on which ones are consumed.

Fiahil · 2024-03-01T11:28:01 1709292481

> The alternative is to dive into the literature on lock-free mpmc queues, which are kind of gnarly to implement. A lot of the literature handwaves ABA problems with stuff like "use hazard pointers and RCU" without correct pseudo code to help you.

Yes and I do think they are fundamentally incorrect or they cannot be translated to the log use case.

I also agree that, performance-wise, the lock make little difference. In fact my old benchmarks tend to point that locks were MORE EFFICIENT than atomics on ARM64 (MBP M1), for example. It's more like a fun little exercise, and also to confirm that I'm not completely dumb and that the problem is not solvable with the simple use of 2 atomics and a counter.

paholg · 2024-02-29T19:18:55 1709234335

I'm no expert here, but I wonder if a linked list of bounded logs would work well.

So, everyone has a pointer to the current one, and there's an atomic pointer to the next.

When the log fills up, you allocate a new one and set the next pointer to it using compare_and_swap. If someone beat you to it, you can walk the list and add yours to the end so that the allocation isn't wasted.

This way, most of the time you're using the efficient bounded log.

Fiahil · 2024-03-01T11:16:47 1709291807

in the case of a bounded log, readers are expected to give an offset at which they want to perfom the read (kafka-like).

So the linked list would involve going through all the links and following end tail refs/pointers. It would make reading O(n) and that's a Nope.

However, you could imagine having a Vec index that contains a ref to all allocated inner-logs, and query the index first in order to obtain the buffers' location. That works, but then the index has to go through a lock (either a RWLock or a mutex) as the CaS operation isn't enough if we get to the end of the index and it needs to be reallocated. It's fine, and I think that's the most appropriate solution.

PS : In fact, there is a sweet spot where you'd like to have a semi-unbounded log. If your index is big enough to contain something like 4Md entries, you'd probably end-up splitting the log in several pieces for archiving and performances purposes. Loading the log (fully or partially) from disk efficiently is then more important than having a real unbounded log. Then you would not necessarily use a lock and could CaS in the index.

jamesmunns · 2024-02-29T17:17:47 1709227067

I am not sure! Most of the data structures I design are for embedded systems without allocators. On the desktop, I mostly defer to others.

I've used tokio's broadcast channel quite a bit before, but it is also bounded. After talking to Eliza from the tokio project, I'm fairly convinced that unbounded queues are a scary thing to have around, operationally :).

But again - this is a bit out of my actual expertise!

jayshua · 2024-03-01T02:12:08 1709259128

Not sure if it could be extended here, but I've seen a lock free hash map that supported lock free reallocation by allocating new space and moving each entry one by one, either when the entry is accessed or in a separate thread concurrently. Accessing an entry during the reallocation would check the new region first, and if not found check the old region. Entries in the old region would be marked as moved and once all entries were moved the old allocation could be freed.

Fiahil · 2024-03-01T11:19:09 1709291949

For the (un)bounded logs, the whole concept reside on the fact that the log isn't going to move once allocated, and that references to an item will never be invalided until the end of the program

cangeroo · 2024-03-01T08:43:46 1709282626

ELI5?

I see a lot of critique in the previous (2019) thread, but no summary in key points. What are the reasons that this is hard?

In multiprocessor systems, are memory writes not guaranteed to be sequential? e.g. can data be written out-of-order, after the synchronization bit is written?

Or is it more about the use case, that it is optimized for minimal number of instructions, e.g. avoiding additional memory reads? (e.g. by writing data to the buffer, instead writing storing pointers)?

Or is it that you're trying to use constant memory (irrespective of number of threads)?

Because to me, it seems like a trivial problem to solve, if you have sequential writes, store data separately from the queue, and may linearly scale memory with number of threads.

jamesmunns · 2024-03-01T13:24:57 1709299497

ELI5: We have a shared chunk of memory, with one sender, and one receiver. We want this to work without an allocator.

Instead of pushing and popping one byte/thing at a time (inefficient, a lot of overhead), we want to have the ability to push or pop big chunks at a time.

We also don't want to pay to copy these chunks in or out, because we're going to have the hardware do it for us autonomously. This means we have to be very careful that the "one side" doesn't look or touch a chunk of memory currently being used by the "other side".

The ideal flow is that we have the CPU get some "writing space", the CPU asks DMA to fill it up, when DMA is done, the CPU marks it as "ready to read". Then at some later time, the CPU (maybe another core or thread) asks for some "ready to read space", the CPU then either uses it, or maybe asks DMA to copy it somewhere else. Then the CPU marks that data as "successfully read", and the space can be recycled the next time something wants to send.

The trick is how you coordinate access to shared memory between two CPUs, correctly, and using the least overhead, so neither side prevents the other from reading or writing, as long as theres a little slack in the line.

Certhas · 2024-03-01T09:20:36 1709284836

Seems to me that the main critique was from someone who didn't understand the specified problem, and complained that this didn't solve a more general one.

(... and probably wasn't aware that the assumptions made in the specification can be encoded in the API signature in Rust. This wouldn't be possible in C, which is why C forces you to solve the harder problem or rely on your users to not accidentally misuse the data structure.)

el_pollo_diablo · 2024-03-01T09:10:11 1709284211

There is this snippet of code in the article:

  if buffer.len.saturating_sub(buffer.write.load()) >= write_len {
    // not shown: check `read` to make sure there's enough free room
    buffer.watermark.store(buffer.write.load() + write_len);
    buffer.write.store(buffer.write.load() + write_len);
  }

I would have to look at the implementation to know for sure, but that part looks incorrect to me. Suppose that the writer has placed a watermark strictly before the end of the buffer, wrapped around, and is about to write a new message; meanwhile, the reader has not reached the watermark yet (and therefore, has not wrapped around), but it has made enough progress to leave room for the writer's new message. In that case, we have write <= write + write_len < read <= watermark < len. As written, it would seem that the snippet above would incorrectly update watermark, whose value is still needed by the reader.

It seems to me that watermark need not be atomic anyway: it is owned by the writer when read <= write, and by the reader when write < read. Whoever wraps around implicitly passes the ownership to the other thread. More precisely:

* In the initial state, and until the writer reaches the end of the buffer, read <= write, and the writer owns watermark, whose value is irrelevant. In particular, the reader makes sure not to overtake write, and knows that it must not access watermark.

* When the writer needs to wrap around, it first sets the watermark (possibly at len, if the hole is empty), then updates write. With its usual release memory ordering constraint, the atomic store to write transfers to the reader the ownership of the written messages, but also of watermark.

* Eventually, the reader notices that write has wrapped around, and then uses watermark to know where to stop reading. Meanwhile, the writer may fill the beginning of the buffer, carefully not overtaking read - 1, and never touching watermark.

* When the reader finally wraps around, the atomic write to read transfers to the writer the ownership of the consumed message slots, but also of watermark, and we are back in the first phase.

IOW, watermark has the same owner as the last element of the array.

What do you think? What have I missed?

iwasanewt · 2024-03-02T17:13:23 1709399603

Hi James! Does `https://onevariable.com/blog` have an rss/atom feed?