Cache Concurrency Control

dormando · on May 31, 2019

People learn and re-learn this at almost every company, which I find fascinating. The redis nx bits aren't a bad way of dealing with it.

Since they mention memcached: I've been working on a protocol extension to bake this exact thing in more directly. Though in braze's case, it's unclear to me why they didn't use the method of add'ing a secondary key with a low TTL since that doesn't cross systems at least?

With the new protocol to memcached you get "win" tokens, which are very loosely similar to leases. Rather than explicit lease tokens a client is notified of if it "won" or if an object is "stale" etc, and the existing CAS mechanisms are used for replacing objects.

IE: If you fetch an object and miss, it'll auto-create an object with a specified TTL, and return a CAS value (a version number). Winner recaches, other clients are told to retry or wait.

Closer to the braze use case, you can set a "TTL remaining threshold" with a request. If you fetch an object which initially had a 180s TTL, but now has a <90s one, you get a win token. Other clients get the existing value, but only one client is allowed to recache.

There's a bit more to it as the changes are trying to stay flexible for a number of possible scenarios. Cuts out roundtrips and finally gives people more modern cache semantics to work with built in. Hoping to ship this soon, but I need to track down client authors for feedback.

jwahba · on May 31, 2019

For patterns like this I like to reach for request coalescing. Here's an example of a package that does it in the golang standard lib https://godoc.org/golang.org/x/sync/singleflight

regecks · on May 31, 2019

Big fan of request coalescing, was introduced to it by Varnish. Really helps when a cache entry drops.

I believe what the article did is called a "grace period".

Are there any open implementations of distributed request coalescing? Pretty easy to do using a service proxy, but what about something like a Raft cluster?

tyingq · on May 31, 2019

I'm curious if assigning a random value from an acceptable range for the expiry time is a common approach. I assume that wouldn't work for all cases, but might spread the avalanche for some.

Google searches on this topic (random expiry to mitigate an avalanche cache refresh) turn up very little.

NovaX · on May 31, 2019

It is very common and often referred to as jitter.

A variation, called scaled ttl, was nicely discussed in this video. https://www.youtube.com/watch?v=kxMKnx__uso