For popcounts this is the classical solution. Make aggregation lazy and worry about keeping k smaller numbers accurate cheaply, and let the reader worry about calculating a total, which is likely eventually consistent anyway.
But for multiple read exclusive write this is never going to work. Or at least not without extra silicon. Which maybe should be a thing. The problem of course would be that a multi-word atomic sum instruction would actually have to cross cache lines to avoid false sharing. So you'd end up with counters on contiguous cache lines but occupying 64 bytes apiece, which is a lot.
It would almost call for a special region of memory that has different cache coherency behavior and I can't see that being easy, fast, or backward compatible which is maybe why we don't do it.
But for multiple read exclusive write this is never going to work. Or at least not without extra silicon. Which maybe should be a thing. The problem of course would be that a multi-word atomic sum instruction would actually have to cross cache lines to avoid false sharing. So you'd end up with counters on contiguous cache lines but occupying 64 bytes apiece, which is a lot.
It would almost call for a special region of memory that has different cache coherency behavior and I can't see that being easy, fast, or backward compatible which is maybe why we don't do it.