There is a very good FIFO-ish queue which is much simpler than this. It would be interesting to compare their performance on real machines.
FIFO-ish queue
--
We allocate a fairly large array of pointers (make it a power of two if you need absolute speed). We'll call it a ring buffer.
Writers:
Each writer acquires a random per-thread index into the ring buffer.
A write is performed by CAS into the index, with the assumption that the buffer entry at index is NULL. CAS succeeds index++, job's a good'un. CAS fails index++ and try again.
Readers:
Each reader acquires a random per-thead index into the ring buffer.
A read is performed by first reading a pointer from the index, if the pointer is NULL index++ and try again. If the pointer is not NULL then CAS to NULL. CAS succeeds index++, job's a good'un. CAS fails index++ and try again.
--
This has a number of desirable properties, it is very memory locality friendly, all we ever do is walk memory word for word in a very predictable pattern. With a well sized ring buffer we can make reads and writes very independent reducing cache-contention etc. It does not perform any memory allocation or create garbage, which can be a problem in some domains.
It's also very simple, it is quite feasible to write a correct version of this data structure without PhD level work.
The design was communicated to me by Martin Thompson and has been implemented in real systems. I do not know how many cores it was used on.
The real question is, if we finally will be able to use lock-free, wait-free concurrent skip-lists with this new variant, without paying patent fees to Oracle?
FIFO-ish queue
--
We allocate a fairly large array of pointers (make it a power of two if you need absolute speed). We'll call it a ring buffer.
Writers:
Each writer acquires a random per-thread index into the ring buffer.
A write is performed by CAS into the index, with the assumption that the buffer entry at index is NULL. CAS succeeds index++, job's a good'un. CAS fails index++ and try again.
Readers:
Each reader acquires a random per-thead index into the ring buffer.
A read is performed by first reading a pointer from the index, if the pointer is NULL index++ and try again. If the pointer is not NULL then CAS to NULL. CAS succeeds index++, job's a good'un. CAS fails index++ and try again.
--
This has a number of desirable properties, it is very memory locality friendly, all we ever do is walk memory word for word in a very predictable pattern. With a well sized ring buffer we can make reads and writes very independent reducing cache-contention etc. It does not perform any memory allocation or create garbage, which can be a problem in some domains.
It's also very simple, it is quite feasible to write a correct version of this data structure without PhD level work.
The design was communicated to me by Martin Thompson and has been implemented in real systems. I do not know how many cores it was used on.