I feel like the moral of this story is less "caching is a liability" and more ga...

nerpderp82 · on Nov 14, 2022

Your performance topology is dictated by multiple hierarchies of caches, many of which you might not be aware of. If they disappear your application may no longer function. The cache is no longer and optimization but a critical component. Figuring out how to fill it might be harder than how to invalidate it.

hinkley · on Nov 14, 2022

> Figuring out how to fill it might be harder than how to invalidate it.

This always seems to be a problem for someone else or for another day, which is part of the frustration.

Cold caches killing your app are a considerable problem, doubly so if you deploy in the Cloud because now you aren't necessarily guaranteed to be deploying into a hot data center. I think there's an unspoken assumption we haven't really eliminated that was if you came to work and the power was out, you not only expected none of the software to work, but you peer pressured other people into not complaining too loudly about it when the power came back on. Today is a loss. Everyone can see it.

But for instance if I work on an SaaS project in the Cloud, not only can I just lose a region, but for many problem domains my customers have mostly local customers, then I have diurnal cycles of traffic that are customer specific. All the non-bot traffic is in New York until 6am GMT, say.I might want to spin regions down pretty aggressively during parts of the day, but if spinning back up results in a thundering herd problem because 'the cache makes us fast'? Well then the value of my circuit breaker alerts drops to less than zero.

That's just the first time things break. Caches and capacity planning are also at odds, because capacity planning is about planning your worst case and caches are almost always about the happy path.

polynox · on Nov 21, 2022

Caching adds leverage and therefore risk. That risk, in particular the "thundering herd problem", is a special case of reducing the system from stable to meta-stable. People think it's stable but it's actually meta-stable, and cache flushes are where you actually push the system hard enough to find out where your stationary points really are.

> These metastable failures have caused widespread outages at large internet companies, lasting from minutes to hours. Paradoxically, the root cause of these failures is often features that improve the efficiency or reliability of the system.

> Caching can also make architectures vulnerable to sustained outages, especially look-aside caching. [...] If cache contents are lost in the vulnerable state, the database will be pushed into an overloaded state with elevated latency. Unfortunately, the cache will remain empty since the web application is responsible for populating the cache, but its timeout will cause all queries to be considered as failed. Now the system is trapped in the metastable failure state: the low cache hit rate leads to slow database responses, which prevents filling the cache.

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

Moreover, in my experience, caching as reached for by grandparent comment is done reflexively rather than a true investigation about the nature of the temporal, spatial or other locality actually present.

Indeed, whether data is cacheable actually fits within a set of constraints than is smaller than typically considered, two dimensions of which are typically out of the control of the practitioner:

1. Whether the data exhibits sufficient temporal or spatial locality (at the place that it is accessed [1]) to facilitate caching,

2. Whether the read consistency can be sufficiently relaxed by policy (such as TTL vs. read consistency), or else whether the writes can be replicated to the caches in such a way as to achieve invalidation in sufficiently low latency in a lossless way, and

3. Whether the size of the cache that is required to meet the required hit rate and TTL/eviction goals is feasible in the system.

If your data is so small that a meaningful fraction can fit within memory, and exhibits good temporal locality so that there are "hot keys", and the TTL that you impose gets you the hit rate you actually need while being within your policy requirements for read consistency? OK, that can be a good fit for caching. But those are also empirical questions which very much are NOT obvious beforehand and certainly not reflexively as "just throw a cache at it", and also need to be scrutinized especially with long TTLs as above for the metastability reasons enumerated above.

[1] Note that with modern horizontally scalable systems with round robin load balancing, this means that you actually need roughly X times as much locality if you're using in-memory caching rather than network attached like Redis or memcached, or else you also require sharding - X being the number of k8s pods or heroku dynos or ec2 instances or what have you. So even though your global temporal locality is maybe high, this might evaporate as your spread the load across random pods. Having a system that is "horizontally scalable" but the cache hit rate that your system depends on to be warm becomes smaller and smaller as you scale up has ... interesting consequences, reminiscent of the multi-master scalability problems from scaling relational databases. It scales to a point, but probably not farther.

ludston · on Nov 15, 2022

"Performance topology" sounds like some interesting jaargon I could learn. Know any good books on it?

nerpderp82 · on Nov 16, 2022

I am a professional technical neologista, anyone can learn the skill with practice.

hinkley · on Nov 14, 2022

Silver bullets kill everyone, not just werewolves.

Garbage In, Garbage Out is not wrong per se, but we have in the real world the concept of an attractive nuisance, where the victim is not entirely at fault for their injuries. The thing, being prone to invite misuse and misadventure, bears some responsibility for protecting people from themselves.

Caching is an attractive nuisance. These days it's simply replaced Global Shared State which used to be the boogeyman, but sadly we haven't collectively clued in on them being equivalent.

bawolff · on Nov 14, 2022

In this situation it was stuff that would be terrible both with and without caching, that didn't have much to do with caching itself, just the caching just provided a bit of a bandaid. I think the more apt metaphor might be "seatbelts make people drive recklessly".

Would fast computers also be an attractive nuiscense because you dont have to think about performance as much?

hinkley · on Nov 14, 2022

> Would fast computers also be an attractive nuisance because you don't have to think about performance as much?

Sometime in the last week or two someone asserted just that. I didn't dig into that discussion very far but on the face of it? Sure. Certainly grabbing the fastest hardware available to Man is an expensive proposition and you should hold that in reserve. You don't get to play that card very often, and much less so today than during the Golden Era. Post Moore's Law for sure, but to an extend post-Dennard as well, since power draw of a facility is a significant cost center. And without Dennard you can make them more powerful but keeping them efficient gets expensive.