For most applications the optimal size of a (time-based) cache is around the point that keeping an item around in the cache is equal to the cost of accessing it again. This is a point that Jim Gray made in the classic "The 5 Minute Rule for Trading Memory for Disc Accesses" (http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf) back in 1986.
A side effect of that is that systems that run faster, and therefore access each cached it more often, get more benefit out of a cache by pushing that set point out further. Whether that happens in any given system depends a lot on the access patterns, and especially temporal locality within those access patterns, but the effect is real in a lot of different kinds of systems.
> It's common wisdom that systems with caches can fail badly when the caches are cold. When every request misses the cache, the system is overloaded, and never recovers. What I observed was the mirror image.
The bottom line is that both locality and memoization are huge contributor to the performance of systems, both positive and negative. When folks (including me) point out the risks of caches, we're not implying that caches are always bad. Instead, it's pointing out a risk that sometimes they are too good, leading to a system that can't function when the normal level of locality and memoization are not available.
Caching is a weapon of last resort. In most cases it is the last major performance improvement you will see from your code, even if other options were available, and any information architecture problems you have are set in stone or will get worse.
Because of the locality aspect, people will assume that it's 'free' to look up a value multiple times instead of passing these data dependencies between the relevant methods. This turns your cache into global shared state, and once that happens you can't turn the cache off again. There are other problems with this arrangement as well.
The easiest to understand (and yet I've failed on at least one project) is that if the working set for a problem ever becomes proportional to the size of the cache (>1 for single task systems, > 1/n for systems with n concurrent tasks), then you can end up evicting data in the middle of a transaction. The obvious consequence is that it causes performance to jump off a cliff. Particularly bad behavior with LRUs but switching to a different statistical model simply reduces the height of the cliff. There is still a cliff. The less obvious consequence is now you have concurrency bugs, because the code assumes that decisions made about the data will go the same way on each call, but due to eviction, two halves of the transaction may make conflicting decisions with undefined behavior.
But speaking as the guy who fixes the perf issues other people gave up on: Once you introduce caching you poison the performance well. Flame charts lie when cache is used, and after people rely on the cache it is lying in the other direction when the cache is off, because you're calling that path 5 times now and it's swamping other bottlenecks. This is what I really mean when I say it's a weapon of last resort. Because once you use it, it attempts to take any other options off the table. Caching will be most of the last order of magnitude in scalability that your system ever sees. Every 'win' from there on will take more time and energy than most people are willing to invest.
I'm not sure everyone here is talking about the same thing when they talk about caching but if you took caching out of the modern Internet it would collapse immediately. There's caches in your CPU, there's caches in your disk drives, there's caches in Linux, there's caches in every layer of every major cloud service, there's caches in your network (CDN).
If your CPU had to fetch every instruction and every bit of data from main memory it would run x100 slower. If your disks didn't have caching they'd generally run slower. If Linux didn't cache your filesystem it'd need a lot more disk accesses. etc. etc.
Ofcourse a properly designed cache has to consider things like what are the access patterns, how to figure out what to keep in the cache and what to toss away, how large is the dataset we're caching, invalidation etc. But there's no performance optimization that's going to get that NetFlix movie to 100M subscribers directly from the disk it's stored on, through the network.
Feels like some of the discussion is about something like storing f(x) in a map instead of computing f(x). I wouldn't use this terminology (caching) in this case. This is more like your standard space to time algorithmic tradeoffs.
My first impulse was to disagree with you, but I realised I gave the exact same advice as a consultant to a group managing one of the bigger government websites that had some performance issues.
They wanted to paper over the issues by slapping a CDN in front of the site. Their users are all local, so the global aspect of the CDN wasn't the point, the benefit was that it would cache the content and provide a performance boost.
I told them that, yes, a CDN might help, but they should optimise their site first, and only then add a CDN on top if they really need to. Otherwise they'll just mask their issues. It's too easy to end up with queries that take 1 minute but are cached for, say, 1000 separate users. Most people most of the time will have a good experience, but you'll end up with a hideous 99th percentile that will be really hard to detect and fix.
I’d often take the opposite approach (and have done with consulting clients)
Slap the CDN on the site and then chase down the slow server responses
It delivers an improved experience for the people using the site sooner than optimizing the site first
The challenge of course is you then have to persuade people to go chase down the slow responses when they may response “there’s so few why should I care?”
From a web standpoint I reminded them that the browser already has a cache so you would be better off getting your cache semantics clear first, debug it at the client and then infill later if and as necessary.
> Caching is a weapon of last resort. In most cases it is the last major performance improvement
...is wrong. It is one of the very first things I reach for because it is the most effective and simplest, once you've eliminated O(n^2) behaviour etc. It's not a magic wand but nothing is.
That's not a very convincing argument. "I use it because it works" without refuting any reasons I presented as to why it "works" but breaks your other options.
How do you teach people to effectively use a profiler on code with pernicious caching?
Part of why I'm the "Cleaner" is because I've worked with any number of people who will present "evidence" that everything that can be done has been done. Here are the metrics from a year ago when we turned on the cache. Here are the flame charts. See? Nothing to see here. They get really uncomfortable six months later when I'm reporting 2-3x in code that can't be optimized further. And in each case everything was made more difficult by the people ahead of me chumming the waters.
They were partially correct. There was in fact nothing left for them to see, so nothing that they knew to do. But "there's nothing to see here" and "there's nothing here" are two very, very different statements.
There's a somewhat tangentially related heuristic to your points imho. When writing code, when in doubt, task CPU over memory, it will generally yield leaner, better software. Easier to profile, debug, and optimize. Much easier to scale in production environments too.
Code which optimizes for fast running time, tends to lean heavily into using memory-bloating data structures and caches. These are often poison for aggregate performance in real world environments, and like you said among other things, reduce visibility in the workings of system.
There's some implications of this stance; e.g. optimizations to reduce memory footprint are generally never wasted, while optimizations to squeeze out a few ns of runtime may have many more unexpected consequences.
The issue with caches is they are easy to apply. You can easily drop a cache in any place you load a resource without significant code changes.
I tend to think they are ok weapons of sometimes first resort (rarely changing resource with heavy system impact). But you'll definitely get better results simply by rethinking and rearchitecting stuff. (Did you really need to load the database just to lookup a single value?)
> You can easily drop a cache in any place you load a resource without significant code changes.
Only when the cache is brand new. The problem with people with cache addiction is that they think they're free. So once they know a cache is there, they begin to lean on it, and lean on it. I've seen it with many different teams in a number of verticals. Sometimes while I'm there, sometimes before I arrived. It's the people not the company.
This illusion that you can always turn it off later is dangerous. That's only true if you have an impeccable information architecture when you start. And if you have that, you've already done a lot of heavy lifting before caching becomes your only option.
> How do you teach people to effectively use a profiler on code with pernicious caching?
Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
> They get really uncomfortable six months later when I'm reporting 2-3x in code that can't be optimized further. And in each case everything was made more difficult by the people ahead of me chumming the waters
Well, they delivered (I'm guessing) 10x improvement in a month by adding cache. Caching is entirely reasonable first step and it can be good enough to be last one.
And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
The more the dynamic behavior of the profiled system and the real system deviate, the harder it is to isolate what's causing the slowdowns.
> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
Yes, but by doing this they also tremendously increased the complexity of the project, and (unless the team has someone like GP in it) prevented further optimizations. It may still be a good trade-off, but the overall sorry state of pretty much all software everywhere suggests it may not be a good one usually.
> And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
That's kind of reinforcing the parent's point, though.
>> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
>The more the dynamic behavior of the profiled system and the real system deviate, the harder it is to isolate what's causing the slowdowns.
Whose quote you repeat without thinking ? It's not that relevant if you know how profiling works...
>> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
>Yes, but by doing this they also tremendously increased the complexity of the project
Not really in most cases its pretty simple to add it, and if you profiled correctly you'd start with only adding extra code in hottest part of the app
> and (unless the team has someone like GP in it) prevented further optimizations.
Uh, just no. It might've fixed problem well enough to not be touched but it ain't stopping anything, just a different solution.
> It may still be a good trade-off, but the overall sorry state of pretty much all software everywhere suggests it may not be a good one usually.
The assumption here that the same developers that fucked up caching would somehow manage to achieve same speedup with "just" optimizations and less bugs is bold one.
Hell, many low level optimizations are also basically "using cache" (but CPU one) well, althought compiler usually gets you out of most egregious mistakes (unless you decide to write multithreaded C/C++ carelessly I guess.
>> And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
> That's kind of reinforcing the parent's point, though.
What point exactly ? That average developer is bad at their job ? Poster above explicitly said that the worse developers claimed it can't be optimized more in the first place, what else you expect from them ? Cache was the most cost-efficient solution most likely.
> The more the dynamic behavior of the profiled system and the real system deviate
Maybe. I've never found it a problem, but maybe. When I've added a cache, the cache comes with instrumentation for exactly that.
> Yes, but by doing this they also tremendously increased the complexity of the project
It depends. EG if you don't need to worry too much about cache invalidation, adding caching can be as simple as adding a single function call eg like python's lru_cache. It depends on the use case.
> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve
Okay, now I have a profile that says 90% of the time is spent in the code the cache avoids. Now what? Narrator's note: When the cache was put in place it accounted for 40% of run time, but now people use it because it's 'free' and so the code has grown from looking up the data twice to looking it up five to seven times depending on conditional logic.
> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
No, they didn't. That's the problem. No cache gets you 10x. 10x for one step of a 20 step process is a positive improvement but rarely more than 2x for the first one, 4x at the outside. If it does more that that then someone is being titanically stupid. From then on each cache is going to net you less, and less, and less. Only if someone measured wrong or you are blocked by some architecture change that's needed so that caches even make sense, will you see a bigger improvement this year than you did last.
But the pathology I'm discussing above didn't even have anything at all to do with caching. It's the psychology of (ir)rationalizing around performance, the assertion that you've done everything that can be done because you've done everything you know to do. It's a bizarre love child of learned helplessness and ego self defense. It's gotten way worse in a world where people reach for caches first (and thus, I assert, last), but it was always there.
I have a couple coworkers who are trying to be clever with a new cache architecture. Irritatingly by taking a statement of fact made by me 3+ years ago and jumping to the wrong conclusion. They promised 50% in 18 months (I was suggesting we target 90% in three years). Now we are somewhere past 30 months and so far they've delivered lower numbers than I did in 5 man weeks spread out from Christmas to mid March, while juggling other priorities and suffering from Seasonal Affective Disorder. Another coworker, usually cache averse, was making noises about adding a cache for some other pain point, and I started pulling on the thread for a bit until he was satisfied. It'll be next March before they surpass the 20% I turned in with about 8 modest changes. And the architectural improvement from both efforts will be roughly on a par. Top down is sexy, bottom up is not. But it's hard to do the former without the latter. Eat your vegetables.
>> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache
>No, they didn't. That's the problem. No cache gets you 10x.
I was actually being conservative. We had almost 100x increase in some cases.
It was pretty particular app, site took few hundred (IIRC closer to second mark) ms to generate before we started at all.
First it was simple Varnish cache but that had problems, some components never changed, other were updated pretty often (stuff like "current users here")
First it was split to various components, each with different TTLs, then all of that was merged in Varnish ESI.
So you could say invalidate a part of the site and varnish would
* load this component from app server
* load the template it contains from its own cache
* render it.
So
* no refresh would be just direct from cache, single digit ms or lower latency
* refreshing one part of site would have latency of rendering only that tiny part of site on app server
later we also updated that with "grace" feature of Varnish where you could define time interval where cache would be valid but generate request to the backend to refresh
So if said low TTL component was refreshed often
* the component will always go from cache with lowest latency
* if it was on its last seconds varnish would send the request in the background and refresh cache
Now the app itself wasn't very well written in the first place (there is a book of horror stories to write about it) but they (we did ops and testing, they did the app stuff) spent less time than "fixing it properly" and instead rewrote it to a bit more modern architecture. They still use same cache to cache JSON responses to API endpoints tho.
That not even to say how much of DB performance is actually clever caching, after all index is just pre-generated partial cache...
But sure, some apps are better for it than other, 10x is definitely not universal.
> From then on each cache is going to net you less, and less, and less. Only if someone measured wrong or you are blocked by some architecture change that's needed so that caches even make sense, will you see a bigger improvement this year than you did last.
Well, doh, caching something you cached already ain't gonna be that effective. But if you see same request with same done 100 times in a minute, making it take 20ms instead of 100m ain't gonna be as effective as making it take 100us to get from cache. You might want to address it eventually to have lower jitter, but same cache can allow you to make 20 other methods faster on average and that is less code and effort than optimizing 20 other functions..
> I have a couple coworkers who are trying to be clever with a new cache architecture. Irritatingly by taking a statement of fact made by me 3+ years ago and jumping to the wrong conclusion. They promised 50% in 18 months (I was suggesting we target 90% in three years). Now we are somewhere past 30 months and so far they've delivered lower numbers than I did in 5 man weeks spread out from Christmas to mid March, while juggling other priorities and suffering from Seasonal Affective Disorder.
And do you think they would be able to instead go and optimize it better in that time ? That just sounds like incompetent co-worker and not anything to do with cache...
That guy just keeps repeating "it's hard" then 'explaining' why in obtuse terms and in ways that don't stand up (I don't think he understands database behaviour). Frankly I don't think he has much experience in this area. I'm going to give up now, he's spent a fair bit of your and my time and still can't accept he's not right.
I've done much better. Think sticking a a cache in front of a database with seriously complex queries. Yes you need to be be aware of staleness, no it's not a magic wand, but it delivered massive improvements.
> When the cache was put in place it accounted for 40% of run time
If looking shit up in a hash table, at a few hundred or thousand cpu ticks, cost 40% of your runtime then I smell PEBKAC[1], not cache problems.
What makes caching one of the hardest parts of computing is this assertion that it's easy. Everything is easy when you oversimplify it. Some things explode in complexity when you try to force them to be simple.
> Some things explode in complexity when you try to force them to be simple.
do you mean to say iow "simple in the small but when assembled (in the large) is more complex" whereas what we should be aiming for is perhaps "a little more complex in the small but with the benefit that it enables a simpler (and probably more scalable) architecture once assembled (in the large)"??
I kinda feel that your convincing counterargument is just ‘I have a different experience’.
I’ve never seen people reach for caches first, and then discard all further optimizations.
Or rather, I’ve seen them do that, but not out of some misplaced feeling that it was the only solution, just the one that was easiest to accomplish in very little time.
It's dificult to refute your argument because I don't understand your points:
> Because of the locality aspect, people will assume that it's 'free' to look up a value multiple times instead of passing these data dependencies between the relevant methods
what does that even mean? What does 'data dependencies' mean specicially?
> then you can end up evicting data in the middle of a transaction
So in the middle of a transaction, instead of looking up data from a key - very cheap - you should do ... what? Recalculate it instead?
> The less obvious consequence is now you have concurrency bugs, because the code assumes that decisions made about the data will go the same way on each call, but due to eviction, two halves of the transaction may make conflicting decisions with undefined behavior
Assuming I even understand this, you are saying where there's concurrencty there's a risk of pulling stale data out of a cache. Well, true. That's just something you have to be careful of. Add a Tx reference to the cache key.
> Once you introduce caching you poison the performance well.
Not my experience at all. I can't understand why you're saying this.
> Flame charts lie when cache is used
How so?
> and after people rely on the cache it is lying in the other direction when the cache is off, because you're calling that path 5 times now and it's swamping other bottlenecks
I've no idea what this means
> This is what I really mean when I say it's a weapon of last resort. Because once you use it, it attempts to take any other options off the table
Heh! Absolutely not! Other optimisations are orthogonal to caching IME:
Better algo - orthogonal
Go lower level - orthogonal
Go multicore - orthogonal
Noobish SQL query is begging for a rewrite - orthogonal
Use L1/2/3/ chip cache better, or RAM, by careful data layout - orthogonal
Use a better library - orthogonal
Use better data structure eg. than linked list - orthogonal
>> Because of the locality aspect, people will assume that it's 'free' to look up a value multiple times instead of passing these data dependencies between the relevant methods
> what does that even mean? What does 'data dependencies' mean specicially?
It means that instead of having a method like `entryPoint(int)` that looks up a `Foo` by id and then passes the `Foo` around (`verify(Foo)`, `Foo update(Foo)`) you instead wind up with a system that passes around an `int` (`verify(int)`, `/* Unused */ Foo update(int)`) and each layer then looks up `Foo` in the `FooService` (because `FooService` has a cache and subsequent lookups should be essentially-free).
Even worse when `FooService` is abstracted behind a thread-local like `ThreadLocal<Foo> currentFoo`, the implementation of which is `fooService.findById(currentFooId.get()) /* cheap, because fooService has a cache */`.
My own take on this is that caches are global-by-default when you really want local-by-default (e. g. dynamic programming instead of a `static final Cache theCacheForTheEntireProgram`). Local-by-default caches are easy to make transactional (when the call stack returns the cache is gone) and don't grow in an unbounded manner in the general case.
Local by default also has the feature/problem of making you look at the scale of the data involved in your query up front, before it's having a 100-1000 req/s hitting it in production. It's harder to pretend this large dependency graph will just magically take care of itself (or worse, be someone else's problem after you've declared victory) when it's staring you in the face.
In my extensive experience of optimizing stuff it is right.
I cannot even count how many issues were created downstream of the decision to start caching X, Y, or Z. At a point, every other software engineer gets the bright idea to cache their special little function or request or database call. Soon enough, there are N layers of cache responsible for delivering some special piece of data. The engineers build tooling to clear those caches on updates, either manually or automatically. Both suffer from issues, the latter occasionally failing to perform and just causing general confusion about why stale data is returning, and the former being far worse, recruiting dozens of business stakeholders into the problem of knowing-how-to-expire-cache. They start making their own internal wiki pages with steps or various cache tools to clear, often incorrect steps because they got lucky one day by clearing cache X and not cache Y that they did not know about.
Overall, in my experience, allowing software engineers to do caching as they wish without any real guardrails has caused massive problems most likely on the order of N!. I cannot roll my eyes far enough into the back of my head when I hear someone suggest caching as a solution any longer, at least not on an application that has more than a dozen business stakeholders managing data and expecting it to be updateable in any reasonable way so that customers see the most recent data when the business wants it.
I have one particularly baroque data path that I think passes through three layers of cache. The same people worked on it the whole time, and I'm 90% sure that at least one of those layers is doing absolutely nothing, except making the code harder to read. Once they're added nobody takes them out. Like all global state it's problematic to be absolutely sure it's not being used anywhere.
Caching can turn O(n²) algorithms into much faster algorithms. I would point to how transformative tabling is in Prolog, not to mention memoization rescuing the calculation of fibonacci numbers by recursion, which otherwise would be gross malpractice.
Caching is fine is there's no state involved. e.g: pure functions like the fibonacci sequence.
As soon as your cache depends on a mutable state, things get really hairy. Cache invalidation is one of the 2 hard problems in computer science, after all.
If it's really no state it should be a compilation step.
After watching some coworkers play keystone cops for 5 years I'm preparing to do just that to the last project they were responsible for. Their new project should have been that as well, but no it's going to be cache.
This gives me flashbacks to Intel's great disaster of Itanium which seems so ridiculous in retrospect but I guess very few people understood superscalar processors in the 1990s.
I did some serious thinking about processors that use the transport triggered architecture
which are a simple way to make a custom CPU (for the first time I thought, "I can make a custom CPU!") but do operations in lockstep in such a way that the whole thing blocks if a memory operation stalls. You might be able to get away with however because the TTA is meant to be codesigned with the code it runs on.
Itanium was smarter than a TTA but it was predicated on the idea that a compiler could know how operations can be interleaved at run time. The compiler has no visibility into what gets cached, but at runtime the processor can opportunistically run operations that aren't stalled. For a general-purpose processor that's supposed to run general code, conventional superscalar processors do better than VLIW processors that require compilers to guess ahead of time what is impossible to guess.
VLIW works _really_ well on a deterministic ISA. As in when you know how long a load is going to take at compile time because you know it's N clock cycles from that type of memory.
Otherwise I really like the GPU derivative form where N instructions are ready to go and the hardware picks one where the prerequisite memory access has completed, despite that being variable.
> Otherwise I really like the GPU derivative form where N instructions are ready to go and the hardware picks one where the prerequisite memory access has completed, despite that being variable
Hmm, isn't that OOO (out of order) execution? And isn't that something GPUs explicitly don't do? (AFAIUI)
I'm probably describing it poorly. A GPU (at least the AMD ones) has N execution units (of a few different types) and M threads (wavefronts) waiting to be scheduled onto them. Each cycle (ish), hardware picks one of the M threads, runs it for an instruction, suspends it again. Memory stalls / waits are handled by not scheduling the corresponding thread until it's ready. The operation might be a CAS over pcie so it's really variable in latency.
Provided you have enough (ideally independent) work item / task / coroutine things available to enqueue, and you're designing for throughput instead of latency, that keeps the execution units or memory bandwidth fully occupied. Thus win.
The x64 style model is deep pipeline + branch predict / speculation and it all goes slow when the guess fails. In contrast the GPU doesn't have to worry about guessing a branch destination incorrectly and unwinding wasted work. It does have to worry about having enough work queued up, and distribution of tasks across the compute unit / symmetric multiprocessor units, and the hardware-picks-the-order leaks into the concurrency model a bit.
Wouldn’t this whole problem be solved by a graph-based architecture, used already by some compilers as an intermediate? Then the CPU sees where and when everything is needed and schedule the loading of registers as optimally as it is feasible, at runtime.
These are very well-behaved instances of caching. They are so well-behaved that they can be fully described with a little bit of analysis. Once I/O is involved or the load isn't well-known beforehand, it becomes much more difficult. Lazy I/O in Haskell is a very divisive topic for a reason.
Memoization is caching and a form of dynamic programming; thus caching is sometimes an important tool in dynamic programming.
But the fact that caching can be useful, even essential if your scenario demands such a clear-cut space-time tradeoff as in memoization, seems beside your point.
Well, usually memoization gets you polynomial time from exponential time with the space tradeoff. It's unusual to start with `O(n^2)` and get a better DP solution.
If I had to pick the two worst sources of bugs I can imagine, it would be (1) concurrency and (2) caches. Those functions often require extensive performance and load testing and have tons of non-deterministic failure modes.
I like applying the Circuit Breaker approach to caching. Rather than applying a cache everywhere, setup a cache wrapper that only triggers if the accessed resource is slower than a certain threshold.
The goal is for the cache to go unused most of the time, but during times of unexpected system stress the caches will kick in, provide relief to the system and reduce impact to users.
In theory, the best way to manage this situation may be with a grayzone - if all is well, then there's only a (say) 5% chance of going to cache. (To at least catch "the cache is broken" errors sooner.) As things get worse, that chance goes up as you cross-fade to mostly using the cache.
Important Warning: If you do not have a very good understanding of your overall system, or neglect to carefully dampen the cross-fade functions - then the whole thing can demonstrate fast and unpredictable swings in behavior. Note that being subpoenaed to testify in a "Why did the Crater City chemical plant blow up?" investigation is usually bad for one's career.
That's better but composition is always the devil in the details. Which might be what you mean when you say:
> or neglect to carefully dampen the cross-fade functions
People with a 'caching problem' don't have one cache, they have a bunch of them. So that 1 in 20 chance of getting grey zoned on one cache, is 1 in 400 chance of getting grey zoned on two caches, and a 1 in 8000 chance to get grey zoned on 3. That is a very small fraction but when you're dealing with the Law of Large Numbers that's something that's happening at least once a second for many services.
But if you're using this you probably have a lot more than 3 caches, and then we have to talk about the odds that any request gets grey zoned at least once. which is going to be 1 - (19/20)^n. At six caches you're already over 1 in 4, which is going to potentially change your mean response time quite a bit.
> "...which is going to potentially change your mean response time quite a bit."
My experience is that the changing mean response time isn't nearly as bad as the changing standard deviation.
The really big problems tend to arise when you have unsuspected or under-appreciated positive feedback loops - whether in the software, or in associated physical systems and human behaviors. Or resonance effects. Those situations are where good dampening gets really important.
Depends on how large the cache is and how much users expect instant updates to their data. If real-time or near real-time is the expectation, then avoiding the cache usage unless necessary preserves their experience without incurring the hurdles of cache invalidation that grow more complex the more you cache.
EDIT:
Another area where this approach is beneficial is with a multi-tenant system where larger tenants querying more data could have a much more significant performance impact than other tenants on the system.
Time of day/week/month usage spikes can also cause a performance degrade for everyone even if it's not typically stressing the system.
This approach is just a mitigation to prevent the system from getting overload during those times. It's not the right approach for every situation. Just a spin on the Circuit Breaker Pattern.
I've always been a bit of a fan of debounce for this but many of the obvious implementations are cache based which is essentially inviting people to do the thing I just said I hate.
The cost of counting upvotes or logged in users is not worth the incremental value it affords the user. These are examples of situations where 'eventual consistency' can be embarrassingly easy. If you split the source of truth (SoT) from the system of record (SoR), then the cost of updating the 'truth' can be amortized across the writes to the SoR. The reads do a straight lookup from the SoT, which was the truth at some point but is now the fiction everyone is agreeing to. The distinction is that everyone agrees to the same source of truth. Nobody ever goes asking the SoR for a second opinion, except the SoR, which updates the SoT when appropriate. The SoT is dumb, the clients are dumber, only the code that manages the SoR has any smarts to it.
Read traffic can escalate pretty high in such a system without much stress.
Exactly. Another situation is a multitenant system where some tenants are much larger than others, triggering larger queries which potentially have a larger performance impact.
I'm beginning to wonder whether my personal moniker of "signal propagation" is nominatively equivalent to what others call "cache invalidation".
I.e. you cannot by definition have a state change until the signal thereof has traversed a link from A to B. No signal, no change. If you have a cache for one system that delays writes to a backing data store referenced by other systems that are unaware of that cache, then you have necessarily a partition problem.
Every problem in a distributed computing context can generally be reduced to a signal propagation problem.
It's only a problem if you're breaking deadlines. Like having someone's status in communication app update with 10s delay is probably not a a problem, so as long as you don't ever cache it for longer than that, it is entirely fine.
> If you have a cache for one system that delays writes to a backing data store referenced by other systems that are unaware of that cache, then you have necessarily a partition problem.
In most cases people mean caching reads so the problem here is stale reads, not delayed writes. It's not necessarily the same, some component caching (or essentially, queuing) write means that once the write is done, all readers (assuming they don't have any extra cache) see it at once, while with caching reads (and no invalidation), multiple readers can have different view
The one that really jams me (and others) up is that signal propagation and transaction lifetimes line up badly. Caches, being cheap imitations of ACID 'compliant' databases, which themselves cheat all day every day to make pretty numbers, almost never have read repeatable semantics, so if your code splits responsibility for fetching a piece of data you have one of the problems I complained about at/near the top level: Half of a decision can be made with one version of a piece of data and the other half with another, resulting in undefined behavior. In a system that asserts that A = !B you can have an interaction that did some of the activities for A and some of the ones for B (because we read A and A' and assumed they were identical).
You're making a lot of assertions without any backup, and they don't make much sense.
> Caches, being cheap imitations of ACID 'compliant' databases
No they're not, not remotely.
> which themselves cheat all day every day
I don't believe a word of this, have you any actual evidence (other than I believe Oracle's serialisable is actually snapshot isolation IIRC)
> Half of a decision can be made with one version of a piece of data and the other half with another, resulting in undefined behavior. In a system that asserts that A = !B you can have an interaction that did some of the activities for A and some of the ones for B
And this just makes no sense, you're claiming that databases are not producing the results they should, if that's the case then I don't believe you understand transaction isolation levels or indeed how database queries work, in which case you have problems other than and larger than caching.
> That's with everything that have any level of complexity, just see how many people got transaction isolation wrong just because they had some simplistic view on how it works in database they use.
Any database without MVCC works quite a bit differently than people think it does. And ones with MVCC run into the ceiling if you assume they work like databases that aren't based on MVCC. Like leaving long transactions running in the background.
The problem is, like caching, difficult to catch in a petri dish.
> This whole thread s just getting weirder.
Maybe you should take a break.
If you assume your database isn't doing much for you, you
> If real-time or near real-time is the expectation, then avoiding the cache usage
Because looking up a precalculated result in a hash table is so much slower than not looking it up? And what, recalculating it instead is faster than an O(1) hash table lookup? If so,why are you caching?
I am just getting more confused by what people are saying here.
I do think you've hit on a point of confusion in the parent, but...
In a (hard) real-time system the only thing that matters is the worst-case execution time -- if your worst-case time fits in the allowance you win, if it doesn't you lose.
Adding a cache improves the average case (typically), but hurts the worst case (always) -- your worst case goes from "do expensive operation" to "do lookup in cache, fail, do expensive operation." Even if you parallelize misses ("do lookup in cache | do expensive operation, on cache success cancel expensive operation, on cache miss wait for expensive operation" you're generally adding at least a few operations in the critical path.
> I am just getting more confused by what people are saying here.
Maybe the takeaway is that this shit is hard and everyone is a little bit confused in their own way.
I'm not saying the only way to win is not to play, but I can neither confirm nor deny rumors that I may have at some point muttered this under my breath.
You can queue cache refresh in the background if the object in cache is close to the "limit" of how fresh you want it to be. Need a bit of cleverness to avoid thundering herd problem but otherwise works pretty well.
Also if you really want near-real time you might want to actually push the updates directly to cache instead of relying client to trigger cache filling. cases where you need a ton of data and it served real-time and it's same object that constantly changes (and not say pointer to a position of video stream) are pretty rare
My problem with the term 'cache' here is that people have expectations about what cache is and when you violate them you tend to discover that they've returned the favor.
> push the updates directly to cache instead of relying client to trigger cache filling
You can use cache software to fulfill such a role but it's not exactly a cache anymore, and keeping people from treating it like one is some kind of cat herding.
You're using it as a data store at this point, which for one has very different eviction semantics - exactly 1 copy of each concept, no more, no less. You can store such things in Consul, for instance, and get very low stale read latency. The cache implementation scales higher but breaks sooner. Pick your poison.
Well, it depends on your data model. Like, if you write only a little (compared to reads), write-thru model of cache (write to cache as well as to backing store) will lose you tiny % of hit ratio at profit of never having to worry about stale data.
> My problem with the term 'cache' here is that people have expectations about what cache is and when you violate them you tend to discover that they've returned the favor.
That's with everything that have any level of complexity, just see how many people got transaction isolation wrong just because they had some simplistic view on how it works in database they use.
Everything nondeterministic adds difficulty to the Scientific Method. Maybe we'll have better ways to filter the noise in the future, I certainly hope so, but that's not where we are now. But this is a little whataboutism.
Caches in the CPU or anywhere else are a leaky abstraction, and sometimes you learn more about a problem by having the abstraction, sometimes you learn more by not having it. Part of being on a growing project is that you get more opportunities to see both sides of a transition and learn from it.
When you have coworkers who clearly have only learned from one half of a timeline and have a great deal to learn about the other, wanting to put the brakes on and have them learn some discipline first is not a crime. Or rather it is, but it shouldn't be. It's like a weird version of Fake it Til You Make it that is verging on fraud. We'll keep up the fakery until someone notices and then we'll get a new job and reset the clock.
> Instead, it's pointing out a risk that sometimes they are too good, leading to a system that can't function when the normal level of locality and memoization are not available.
And at that point cache management starts to be a huge problem. Clearing caches becomes a death spiral because your system can't actually support your normal load.
For example I worked with a large retailer, who had a giant monolithic enterprise Java ecommerce system. It was so slow that product detail pages took upwards of 30 seconds to render. There was a cache in front of that system (Ehcache), then a page-level cache (Varnish) in front of that, then finally Akamai took 98% of the load.
Rollouts were terrifying, you cleared the cache and the site might come back up. Changing Akamai configuration was also similarly risky.
It's critical that systems don't become cache dependent for load, they should be a performance boost, not a required component.
There are only two hard things in Computer Science: cache invalidation, and naming things, and off-by-one errors.
What people don't realize is caches are much harder than just invalidation.
I'm a fan of continually testing (and running) things without cache at all just to verify what they do, caching can often cover a multitude of sins that you really don't need to be doing.
Caches are global shared state. Correction: lossy global shared state.
'Naming things' when they are in a global namespace is really hard, to the point where I'm starting to thing that cache invalidation and naming things are not two problems but the same problem in two manifestations.
Not necessarily. I designed an object oriented simulation system which had a lot of calculated properties where the calculation was a substantial amount of work but the result would be queried many times before the input the function changed. Caching the results inside the object provided a substantial speed up (orders of magnitude) while keeping the cache completely local and hidden from the client code and keeping the program very modular.
Keeping the cache inside the objects allowed us to keep the parts of the program neatly separated and minimized the amount of knowledge that client code, and more importantly client coders, needed to have about the objects they were using.
What happens when the system of record changes the answer halfway through the transaction?
By caching answers to calculations you've moved the source of truth from the function to the cache. But the cache is also acting as a proxy for the system of record, which is a conflict of interests. That's substantially part of the problem with cache invalidation.
ETA: There's also a common category error here that I see with people who grossly misunderstand DP as 'caching'. First that they think that caching and memoization are the same thing, which they emphatically are not, and second that DP is about memoization, which means they listened for the first ten minutes and then fell asleep.
Tabulation is much more powerful than memoization, so equating DP with memoization is impoverishing it and yourself.
But importantly here, the distinction between memoization and caching is that memoization is scoped while caching is global. With memoization a single call tree gets the same answers every time, which means the mental process of Local Reasoning is maintained. If you know the call graph for this operation you know everything that interacts with it. You know if 'eviction' is even a thing. You also know that your unit tests are testing the code as it will behave under load, not some fictional happy path.
tl;dr: If your calculation is important it should show up in your data flow, not arrive out of band from a magic oracle.
If we're splitting hairs, your situation sounds more like memoization, where you have saved the output of an expensive but deterministic calculation. With the correct assumptions, this can never result in incorrect behavior. Worst case is excessive memory consumption. Whereas a stale cache is something that needs to be taken into account.
Multiple times I have found the impact of cache performance to be dramatic. I find that the usual optimisations in three aspects are usually obvious to achieve and can bring surprisingly big improvements.
The first is that RAM accessed randomly or in the wrong direction performs very poorly compared to sequential access in order. The same is true of cache because its much easier to predict and prefetched data when its accessed serially. The effect is so dramatic that its sometimes worth "accessing" data you don't need to get access to this hardware optimisation to maintain serial access. At times for example I have found Treemaps and listmaps perform better than hashmaps as the cheap search serially and in memory order surprisingly out does the O(log(n)) or O(n) algorithms and the cost of the hash.
If you have a data structure that is larger than optimal and overflows cache lines it can have big impacts on cache utilisation and performance if the algorithm doesn't need all the fields. Object orientation impacts quite heavily on this and certain designs can make very large objects and these often get poor cache usage.
Unnecessary allocation and deallocation of memory in algorithms often results in the cache getting thrashed and avoiding this and using the same memory can result in better hot cache usage and reducing garbage collection usage.
Most profilers can give you a good idea of the allocations but the other two are often harder to find and mostly just show up in cache misses.
Artickle: speed up lead to avoiding cache misses which lead to further speed up which lead to further less cache misses
This kind of interaction is important to keep in mind.
It applies to all kinds of caches in all kinds of situations. To some degree even the instruction and data caches of modern cpus.
But most interesting is that in complex systems in (very very) rare cases the "whole system effect" can go the other way around too (due to a now faster components trashing the cache for other components in a way which makes the system as a whole slower).
Caching db queries and media assets already got a big boost a decade ago going from spinning rust to ssd to even faster NVME.
Optimizing DB queries on spinning rust was an exciting experience though. Chasing milliseconds with caching and warming caches up still brings back good memories :)
Scale and architecture makes new designs much easier. 1Mb of code... A 100Mb database and 1Gb of media assets is becoming a relic like "big iron". Blank slate canvases should skip a lot of it.
Technology debt only gets harder to manage over time.
This happens in life, too. When your counterparts lose continuity with your work, they turn to other work and the context-switch to return to working with you becomes increasingly hard.
I recently went through flat remodeling which took a month longer than it should and resulted in lower quality, because at the very beginning, I couldn't supply enough work for the contractors (we had to rush into it because ${life reasons}, so not everything was specced out when the work started), so the initial team of workers ended up taking better jobs abroad, and our general contractor had to find people to fill in.
I wonder if he mentions running from a different directory and running as a different user to highlight how annoying the effect from the environment is, as USER and PWD are part of the environment.
If thats true, it seems like you could make a great optimization pass which chooses a code layout in order to make the program run faster. I wonder how predictable code layout is on performance across different CPUs.
There's a similarity in the performance of caches and ML algorithms -- they can have some theoretical absolute performance (or worst case scenario) but whether or not that falls into harmony or discord with the real world use is another question altogether.
Statistics may give you reasons for your faith, but you have to leap on to the next workload regardless.
When thinking about cache, instead of thinking about cache misses directly and thinking about time based optimization trying to improve the temporal locality, it is often easier and more fruitful to consider "working set size" - how much memory the code / loop / function / algorithm is touching at a time.
So.... they could have mostly fixed it by increasing the cache timeout?
No code changes, just increase the timeout. The real lesson is to really know your system and how it behaves. They ran a profiler, but in the call stack they did not see the computation was triggered by a cache miss...
About 15 years ago I was on a team that was having so many problems with caching that I began to see it for the liability that it is. It started with a 2-4 week delay each release cycle while they fixed caching bugs the testers had found, but once I joined the project I saw how deep the problem went.
A number of their features invited a full table scan, which is fundamentally incompatible with an LRU cache, and in particular has disastrous stair step functions as the working set grows. Note that this is also a couple of years before anyone had heard of memcached so these were DIY implementations with multiple potential sources of bugs.
About the time I found my feet on the project someone wanted to add caching to yet another workflow that was taking 30 seconds to run. It was an admin workflow so there was a little tolerance of slow response times, but not 30 seconds. I volunteered for the story because I thought we could do better than more caches.
Through conversation, step debugging, and profiling I found the spot where they had wanted the caching. What the feature was doing was two relatively expensive lookups of the same data, filtering it on two very different dimensions, and returning the intersection of the two. In typical ORM fashion, both of the concerns were taking it upon themselves to fetch their own data instead of having it handed to them. Since they were both called from the same function, I hoisted the lookup out of the callee into the caller, sending the same list to both. Then I changed the code again to feed the result of call 1 to call 2, to make it work more like consecutive filters.
The end result was that the page now loaded in 3 seconds, and I was able to write better unit tests for this code. One might expect that most of that improvement came from running the second filter on a much smaller list, but that's where you'd be wrong. The majority of that improvement came from the hoisting, not the chaining. Due to the amount of work in the filters, the initial lookup shouldn't have accounted for more than a third of the page load time, but just sharing the same data dropped the response time to around 6 seconds. Chaining the filters ended up only saving 3 seconds.
The problem with large data sets is that CPUs have caches too. And in a garbage collected programming language, there are additional costs to carrying around duplicated data, particularly if it is in tree form like this was. Most likely we were smashing the hell out of CPU caches and introducing extra GC pressure to boot. By passing in the same object tree the second call followed warm and/or hot paths through heap memory.
I'd had a similar but less dramatic experiences a few years before that, where I identified that a function, which accounted for 10% of the runtime, was being invoked roughly twice as often as the problem required. By that point 5% looked like low hanging fruit, so I eliminated the duplicate calls and reran the benchmarks. But the results didn't make sense so I tweaked and prodded and fiddled. The new time was 20% lower than the old time, for what should have been 5%. This is already way too long so I won't go into the reasons here, but in addition to the scenarios I mentioned for the previous issue, there's also problems of timing accuracy in profilers, and cyclic use patterns that can misattribute the expense of a call to siblings or the parent. A function that creates many objects may never reach the GC threshold, but a large allocation later in the sequence diagram may. Or collide with cache lines that are needed again later in the call graph.
I feel like the moral of this story is less "caching is a liability" and more garbage in garbage out. Caching can help things when used appropriately, but its not magic.
Your performance topology is dictated by multiple hierarchies of caches, many of which you might not be aware of. If they disappear your application may no longer function. The cache is no longer and optimization but a critical component. Figuring out how to fill it might be harder than how to invalidate it.
> Figuring out how to fill it might be harder than how to invalidate it.
This always seems to be a problem for someone else or for another day, which is part of the frustration.
Cold caches killing your app are a considerable problem, doubly so if you deploy in the Cloud because now you aren't necessarily guaranteed to be deploying into a hot data center. I think there's an unspoken assumption we haven't really eliminated that was if you came to work and the power was out, you not only expected none of the software to work, but you peer pressured other people into not complaining too loudly about it when the power came back on. Today is a loss. Everyone can see it.
But for instance if I work on an SaaS project in the Cloud, not only can I just lose a region, but for many problem domains my customers have mostly local customers, then I have diurnal cycles of traffic that are customer specific. All the non-bot traffic is in New York until 6am GMT, say.I might want to spin regions down pretty aggressively during parts of the day, but if spinning back up results in a thundering herd problem because 'the cache makes us fast'? Well then the value of my circuit breaker alerts drops to less than zero.
That's just the first time things break. Caches and capacity planning are also at odds, because capacity planning is about planning your worst case and caches are almost always about the happy path.
Caching adds leverage and therefore risk. That risk, in particular the "thundering herd problem", is a special case of reducing the system from stable to meta-stable. People think it's stable but it's actually meta-stable, and cache flushes are where you actually push the system hard enough to find out where your stationary points really are.
> These metastable failures have caused widespread outages at large internet companies, lasting from minutes to hours. Paradoxically, the root cause of these failures is often features that improve the efficiency or reliability of the system.
> Caching can also make architectures vulnerable to sustained
outages, especially look-aside caching. [...] If cache contents are lost in the vulnerable state, the database
will be pushed into an overloaded state with elevated latency.
Unfortunately, the cache will remain empty since the web
application is responsible for populating the cache, but its
timeout will cause all queries to be considered as failed. Now
the system is trapped in the metastable failure state: the
low cache hit rate leads to slow database responses, which
prevents filling the cache.
Moreover, in my experience, caching as reached for by grandparent comment is done reflexively rather than a true investigation about the nature of the temporal, spatial or other locality actually present.
Indeed, whether data is cacheable actually fits within a set of constraints than is smaller than typically considered, two dimensions of which are typically out of the control of the practitioner:
1. Whether the data exhibits sufficient temporal or spatial locality (at the place that it is accessed [1]) to facilitate caching,
2. Whether the read consistency can be sufficiently relaxed by policy (such as TTL vs. read consistency), or else whether the writes can be replicated to the caches in such a way as to achieve invalidation in sufficiently low latency in a lossless way, and
3. Whether the size of the cache that is required to meet the required hit rate and TTL/eviction goals is feasible in the system.
If your data is so small that a meaningful fraction can fit within memory, and exhibits good temporal locality so that there are "hot keys", and the TTL that you impose gets you the hit rate you actually need while being within your policy requirements for read consistency? OK, that can be a good fit for caching. But those are also empirical questions which very much are NOT obvious beforehand and certainly not reflexively as "just throw a cache at it", and also need to be scrutinized especially with long TTLs as above for the metastability reasons enumerated above.
[1] Note that with modern horizontally scalable systems with round robin load balancing, this means that you actually need roughly X times as much locality if you're using in-memory caching rather than network attached like Redis or memcached, or else you also require sharding - X being the number of k8s pods or heroku dynos or ec2 instances or what have you. So even though your global temporal locality is maybe high, this might evaporate as your spread the load across random pods. Having a system that is "horizontally scalable" but the cache hit rate that your system depends on to be warm becomes smaller and smaller as you scale up has ... interesting consequences, reminiscent of the multi-master scalability problems from scaling relational databases. It scales to a point, but probably not farther.
Silver bullets kill everyone, not just werewolves.
Garbage In, Garbage Out is not wrong per se, but we have in the real world the concept of an attractive nuisance, where the victim is not entirely at fault for their injuries. The thing, being prone to invite misuse and misadventure, bears some responsibility for protecting people from themselves.
Caching is an attractive nuisance. These days it's simply replaced Global Shared State which used to be the boogeyman, but sadly we haven't collectively clued in on them being equivalent.
In this situation it was stuff that would be terrible both with and without caching, that didn't have much to do with caching itself, just the caching just provided a bit of a bandaid. I think the more apt metaphor might be "seatbelts make people drive recklessly".
Would fast computers also be an attractive nuiscense because you dont have to think about performance as much?
> Would fast computers also be an attractive nuisance because you don't have to think about performance as much?
Sometime in the last week or two someone asserted just that. I didn't dig into that discussion very far but on the face of it? Sure. Certainly grabbing the fastest hardware available to Man is an expensive proposition and you should hold that in reserve. You don't get to play that card very often, and much less so today than during the Golden Era. Post Moore's Law for sure, but to an extend post-Dennard as well, since power draw of a facility is a significant cost center. And without Dennard you can make them more powerful but keeping them efficient gets expensive.
For most applications the optimal size of a (time-based) cache is around the point that keeping an item around in the cache is equal to the cost of accessing it again. This is a point that Jim Gray made in the classic "The 5 Minute Rule for Trading Memory for Disc Accesses" (http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf) back in 1986.
A side effect of that is that systems that run faster, and therefore access each cached it more often, get more benefit out of a cache by pushing that set point out further. Whether that happens in any given system depends a lot on the access patterns, and especially temporal locality within those access patterns, but the effect is real in a lot of different kinds of systems.
> It's common wisdom that systems with caches can fail badly when the caches are cold. When every request misses the cache, the system is overloaded, and never recovers. What I observed was the mirror image.
The bottom line is that both locality and memoization are huge contributor to the performance of systems, both positive and negative. When folks (including me) point out the risks of caches, we're not implying that caches are always bad. Instead, it's pointing out a risk that sometimes they are too good, leading to a system that can't function when the normal level of locality and memoization are not available.