Hacker News new | past | comments | ask | show | jobs | submit login
RobinHood: tail latency aware caching – dynamic reallocation (acolyer.org)
96 points by godelmachine on Oct 26, 2018 | hide | past | favorite | 7 comments



This is similar to how continuous load shifting works in some massively parallel database engines, though working from a different architectural perspective. This problem is at the core of how you scale out mixed workloads when the distribution of data and workload is intrinsically unpredictable and non-uniform. However, in the database cases the cache itself is sharded and a fixed size (basically RAM fixed to a core), for performance and scalability reasons -- the cache isn't a separate system.

Since you have all the local cache metadata and you can't change the cache size per node, you ensure approximately uniform load across nodes by continuously and transparently shipping the smallest number of shards (while under full load) that will give each logical node approximately the same cache performance profile. This works well in my experience. It sounds onerous resource-wise but the number of shards you typically need to ship even in a very dynamic and high-throughput system is relatively small if your algorithm picks them well and the total latency to ship a shard can be pretty small (tens or hundreds of milliseconds?) depending on the design. As a percentage of system load the cost of shipping is pretty nominal and it all but eradicates hotspotting. However, the whole system has to be designed under the assumption that you'll be doing this, it isn't something you can reasonably add on later.

It is great to see more work being published on achieving effectively and apparently uniform distribution of load for data models that intrinsically have unpredictable and dynamic non-uniform distribution of data and load. There are many high-value data models that can't be handled at scale without addressing this problem directly.


Author here: thank you pointing out the connection to parallel DBMs - I was not aware of the problem similarity but difference in approach.

Can you point me to a DBMs that does that, both an OSS or in production?



What's the license of RobinHood's core algorithm? This seems useful.

But this being sponsored research (Microsoft) brings me to thinking of the old ARC cache licensing problem.

http://www.varlena.com/GeneralBits/96.php


No, we do not expect any licensing problems. There's an open issue on github: https://github.com/dasebe/robinhoodcache/issues/1

The issue mentions 3 Clause BSD, which will probably be implemented today.


How widely has this been tested? Is this being used in production?


Author here: this is very much a research project with the purpose of prototyping the idea that caching can help tail latency. Previously, people didn't believe that caches can be used that way and it turns out you really have to rethink how you use the caches.

We build a testbed and replayed production workloads. This is not running in production yet. We're actively looking for new workloads and scenarios to test this at scale!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: