If there's contention between multiple CPU cores (and worse, sockets), 3 atomic ops on same cache line puts the max throughput to 3-15M operations per second per system. Probably around 10M on most 2-4 core CPUs. Multi CPU-socket systems will be slower.
That is, the point when the whole computer is on its knees due to cache coherency traffic between cores.
Note: these are just ballpark figures based on real life experience on modernish x86-64 systems. Your mileage may vary. Always measure actual performance. Etc.
That is, the point when the whole computer is on its knees due to cache coherency traffic between cores.
Note: these are just ballpark figures based on real life experience on modernish x86-64 systems. Your mileage may vary. Always measure actual performance. Etc.