I don't think false sharing matters here. Cacheline bouncing of the counter, whi...

I don't think false sharing matters here. Cacheline bouncing of the counter, which is written even by readers would be enough to explain the bottleneck. The barrier being likely implied or explicit in the atomic RMW used to update the counter will also stall the pipeline, preventing any other memory operation to happen concurrently.

All threads repeatedly CASing on the same memory location is enough to bring any cpu to its knees.

The critical part of the fix is that readers no longer have to modify any shared memory location and rely on deferred reclamation.