>This is the story of how Andrea Lattuada (PhD student at ETH Zurich) and James Munns (from Ferrous Systems) designed and implemented (two versions!) of an high-perf lock-free ring-buffer for cross-thread communication. If any of those words look scary to you, don't fret, we'll explain everything from the basics.
This is not the message I want to see introducing such a complex topic. Writing correct lock-free code is incredibly hard, and you're not going to get it right or even understand it from a single post. I've done a bit of reading on this subject, and I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.
The talk is 3.5 hours, about as long as you'd expect one to be for "writing a concurrent lockfree queue". Its an incredibly difficult topic, but I'm convinced that Pikus's implementation is actually correct.
Although, maybe there's a subtle bug somewhere. And Pikus is REALLLY stretching the definition of a queue to make his implementation work. (This queue may sometimes function like a stack, but that's okay because there's no way for the end programmer to figure it out without a global lock)
---------
EDIT: Oh, I guess you wouldn't consider it "truly lock free", because he has a global lock in the case of the rare situation of a full queue. He uses double-checked lock pattern for that case (and causes an expensive, but temporary, global lock on the data-structure) and basically says "This situation only happens once per hour, so its probably not bad".
So Pikus's implementation is "lock free mostly", but still falls back to a global spinlock in exceptionally rare corner cases.
Oh, that's actually a different lock-free talk from Pikus than I expected. I haven't it watched yet. Thanks!
I probably wrote my original comment a little too quickly. Yes, it's possible to write useful mostly lock-free data structures, but it's hard enough that I consider anyone claiming to have done so in a way that beats the equivalent locked structure (as you pointed out above) to be making an extraordinary claim that requires extraordinary evidence and serious consideration of the difficulty involved.
> I probably wrote my original comment a little too quickly.
Nah. Anyone who actually understands this subject recognizes that a truly lock-free concurrent queue is probably one of the most difficult problems in all of Comp. Sci.
Even if you're using absolutes, its basically warranted because of the difficulty of the problem. No one will blame you for talking about how hard this problem is, even with a bit of hyperbole being used
> This is not the message I want to see introducing such a complex topic.
I don't see the need for such gatekeeping. I think telling people over-and-over how "incredibly hard" something is will tend to discourage people from learning about that thing. I think that's bad.
Warning about complexity and gatekeeping are not the same thing.
First-year programmer says "I can't wrap my head around Python, I'll write my own language!"
A response of "You're an idiot, don't bother, people smarter than you have tried and failed" is gatekeeping, and shitty.
A response of "There is a lot about this area you do not understand, be aware that it is a lot harder than you seem to think it is" is cautionary, but not gatekeeping.
You have moved the discussion away from the original points. What started this is a sentence in the article encouraging readers to continue because the author will explain the concepts behind potentially new words. The poster I replied to complained about that sentence because it does not convey that the topic is complex.
But almost all important topics are complex, and all education has to start somewhere. My claim is that insisting we emphasize the complexity rather than assuring the understandability is gatekeeping because it discourages new entrants and promotes an aura of mystery around the subject.
There are plenty of hard lock free algorithms, but an SPSC ring buffer is not exactly rocket science and it is a good introduction on the topic.
Why do you think that a GC is required for a queue? It is certainly very helpful to implement complex node-based intrusive concurrent data structures, but for queues, even node based and even MPMC, as long as each message is produced by one thread and consumed by another one (i.e. the typical use case), single ownership works just fine.
edit: I see you are talking about hidden locks in the allocator. That's fair, but even GC at some point might need to hit the kernel to increase the heap size.
>I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language.
I'm guessing because of the "magical free that doesn't lock" assumption in several papers?
There are some ideas in the Rust community (mostly centered around audio), most of them use garbage collection schemes that prevent either the producer or consumer thread from locking. It's tricky and easy to break.
Yeah, that's pretty much exactly it. Every "lock-free queue without garbage collection" I've seen is either broken, relying on hidden locks (most commonly in free), or implementing garbage collection while calling it something else.
I'm biased because I've worked on these things, but imo implementing garbage collection for a very limited set of data structures that are all deterministic or explicit in their deallocation is much easier to deal with than a general purpose GC for an entire program. It allows you to guarantee lock/wait free behavior on particular threads, which you don't get with a GC'd language as a whole (depending on the GC of course).
Well, then your mind should be blown as the lock free queue provided in this article works without either locks or garbage collection.
The trick here is that this queue has a fixed capacity so it can allocate everything upfront. You are right that infinite capacity queues can be a bit tricky to implement without violating your conditions. (In many cases you don't actually need an infinite capacity queue though ...)
No, the trick here is that they limit the queue to a single producer thread and single consumer thread. With only two threads operating on the queue, it's extremely unlikely that this is faster than a locked version of the same structure.
As long as it is using sequentially consistent writes is not going to be faster than a two lock queue, as, at least on x86, an uncontended sc write is exactly as expensive an uncontended spin lock acquire (and a spin lock release is very cheap).
On the other hand, a SPSC ring buffer can be implemented purely with acquire loads and release stores, and if one takes care of avoiding reading recently written cachelines when not necessary it can have higher throughput than a spin-locked queue.
> [...] I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.
There's no dynamic memory allocation going on in this particular lock-free queue, so I don't see how your concern is relevant here?
After reading the full article, they solve a vastly more restricted (although perhaps still useful) problem than the title implies. Their design assumes a single producer and single consumer and is trivially broken if that's not the case. I don't have time to write a convincing set of benchmarks to prove it, but I very much doubt that this is faster than a locking implementation of the same structure.
I can attest to the basic lock-free spsc queues being far faster than a basic spinlocking version of the same queue, on xeon-class X86 at least.
LOCK'ed instructions (one might use lock xchg or lock cmpxchg) perform a full memory barrier, which has quite a few performance implications if there are any outbound loads/stores. Further, if the cache line containing the spinlock is in anything except an exclusively-held state on the writing core (it will usually be on a different core or in a shared state), simply acquiring the lock will stall for ~25-30ns+.
On the other hand, a simply spsc lock-free queue can hit <10ns per message with a latency about equal to the core-core communication latency.
This is not the message I want to see introducing such a complex topic. Writing correct lock-free code is incredibly hard, and you're not going to get it right or even understand it from a single post. I've done a bit of reading on this subject, and I'm not even sure that it's possible to write a correct lock-free queue (of which this is a variant) in a non-garbage collected language without some pretty substantial trade-offs regarding memory management.