Although dragontamer's complaints are mostly correct, I think he's being a little uncharitable and is largely missing the point.
Yes, writing lock-free code that's generic, that supports many-to-many reads/writes, and that's correct on all architectures is hilariously hard, and most people should not try to implement these from scratch at work. Other approaches can be more performant, and can have acceptable trade-offs for most use cases.
Are there issues with this one? Maybe. As dragontamer stated multiple times, this stuff is pretty hard to reason about.
HOWEVER, this ring buffer was designed to run on embedded systems. As usual, the constraints are often a little bit different when you're writing software for a microcontroller.
As an example, let's imagine you have periodic bursts of data coming in on a serial port at 2 Mbps, and your microcontroller is running at 32 MHz. Also, your serial port only has a 4 byte hardware buffer, and the cost of missing a byte is... I don't know, sending an endmill through the wall.
You can absolutely make this work, and you can even formally verify that you'll always meet timing, but you're going to have a really hard time doing this if you also have to analyze every other piece of code that will ever try to acquire that lock. A single-producer, single-consumer, lock-free queue with fixed message sizes can be implemented correctly in about 20 lines of C. It has a lot of annoying constraints, and you still have to use atomics correctly, and you still need a memory barrier, and don't even think about resetting the queue from either thread, and... (etc).
But if you're careful, a queue like this can make an otherwise impossible task relatively manageable. I can no longer count the number of times a construct like this has saved a project I was involved with.
Would it automatically run correctly and at full performance on a Xeon running Linux? Fuck no. But that's not the point.
The desirable quality of lock-free queues for embedded systems is correctness, not performance.
I've dressed it up to look like C11, but vendor specific macros from your microcontroller support library are more common. I think you can relax some of the memory ordering too, but I don't remember the details off-hand.
Also, sometimes you can have a modified version of queue_push / queue_pop where a DMA handles the memcpy and then you only have to update the pointers in the cleanup interrupt (along with configuring the next or next-next DMA if the queue isn't full / empty).
The atomic_signal_fence in queue pop/push seems suspicious. Unless you are doing something like asymmetric fences or actually synchronizing with signal handlers you probably want atomic_thread_fence. Still I'm not sure why you want a fence at all there. Publishing is done by the atomic_store to queue_rd and wr which already include (I'm being fast and loose here) the equivalent of a seq_cst fence.
The reason you want a barrier there is to make sure the memcpy and the pointer update don't get reordered so the other thread doesn't have a chance to access that block before it's ready.
You don't actually need any synchronization with the other thread, which is why I think this can work instead of atomic_thread_fence. But you could be right, anyone using this should check this carefully.
Edit: I think you're right. If it is needed, it should be a thread fence. I'm not 100% clear on whether the pointer update takes care of this, so it may not actually be needed.
The block is 'published' when the store to the write pointer is performed. That's a store release which will have an implicit barrier before the store.
Okay yeah, you're definitely right. I didn't realize that the stdatomic load/stores were by default seq_cst. The libraries I'm familiar with have macros for standalone atomic reads and writes, but you have to place any barriers manually. If you were using a library like that, you'd place them in between the memcpy and the pointer update, but here you get that for free.
Yes, writing lock-free code that's generic, that supports many-to-many reads/writes, and that's correct on all architectures is hilariously hard, and most people should not try to implement these from scratch at work. Other approaches can be more performant, and can have acceptable trade-offs for most use cases.
Are there issues with this one? Maybe. As dragontamer stated multiple times, this stuff is pretty hard to reason about.
HOWEVER, this ring buffer was designed to run on embedded systems. As usual, the constraints are often a little bit different when you're writing software for a microcontroller.
As an example, let's imagine you have periodic bursts of data coming in on a serial port at 2 Mbps, and your microcontroller is running at 32 MHz. Also, your serial port only has a 4 byte hardware buffer, and the cost of missing a byte is... I don't know, sending an endmill through the wall.
You can absolutely make this work, and you can even formally verify that you'll always meet timing, but you're going to have a really hard time doing this if you also have to analyze every other piece of code that will ever try to acquire that lock. A single-producer, single-consumer, lock-free queue with fixed message sizes can be implemented correctly in about 20 lines of C. It has a lot of annoying constraints, and you still have to use atomics correctly, and you still need a memory barrier, and don't even think about resetting the queue from either thread, and... (etc).
But if you're careful, a queue like this can make an otherwise impossible task relatively manageable. I can no longer count the number of times a construct like this has saved a project I was involved with.
Would it automatically run correctly and at full performance on a Xeon running Linux? Fuck no. But that's not the point.
The desirable quality of lock-free queues for embedded systems is correctness, not performance.