This is similar to how continuous load shifting works in some massively parallel database engines, though working from a different architectural perspective. This problem is at the core of how you scale out mixed workloads when the distribution of data and workload is intrinsically unpredictable and non-uniform. However, in the database cases the cache itself is sharded and a fixed size (basically RAM fixed to a core), for performance and scalability reasons -- the cache isn't a separate system.
Since you have all the local cache metadata and you can't change the cache size per node, you ensure approximately uniform load across nodes by continuously and transparently shipping the smallest number of shards (while under full load) that will give each logical node approximately the same cache performance profile. This works well in my experience. It sounds onerous resource-wise but the number of shards you typically need to ship even in a very dynamic and high-throughput system is relatively small if your algorithm picks them well and the total latency to ship a shard can be pretty small (tens or hundreds of milliseconds?) depending on the design. As a percentage of system load the cost of shipping is pretty nominal and it all but eradicates hotspotting. However, the whole system has to be designed under the assumption that you'll be doing this, it isn't something you can reasonably add on later.
It is great to see more work being published on achieving effectively and apparently uniform distribution of load for data models that intrinsically have unpredictable and dynamic non-uniform distribution of data and load. There are many high-value data models that can't be handled at scale without addressing this problem directly.
Author here: this is very much a research project with the purpose of prototyping the idea that caching can help tail latency. Previously, people didn't believe that caches can be used that way and it turns out you really have to rethink how you use the caches.
We build a testbed and replayed production workloads. This is not running in production yet. We're actively looking for new workloads and scenarios to test this at scale!
Since you have all the local cache metadata and you can't change the cache size per node, you ensure approximately uniform load across nodes by continuously and transparently shipping the smallest number of shards (while under full load) that will give each logical node approximately the same cache performance profile. This works well in my experience. It sounds onerous resource-wise but the number of shards you typically need to ship even in a very dynamic and high-throughput system is relatively small if your algorithm picks them well and the total latency to ship a shard can be pretty small (tens or hundreds of milliseconds?) depending on the design. As a percentage of system load the cost of shipping is pretty nominal and it all but eradicates hotspotting. However, the whole system has to be designed under the assumption that you'll be doing this, it isn't something you can reasonably add on later.
It is great to see more work being published on achieving effectively and apparently uniform distribution of load for data models that intrinsically have unpredictable and dynamic non-uniform distribution of data and load. There are many high-value data models that can't be handled at scale without addressing this problem directly.