> Caching is a weapon of last resort. In most cases it is the last major performance improvement
...is wrong. It is one of the very first things I reach for because it is the most effective and simplest, once you've eliminated O(n^2) behaviour etc. It's not a magic wand but nothing is.
That's not a very convincing argument. "I use it because it works" without refuting any reasons I presented as to why it "works" but breaks your other options.
How do you teach people to effectively use a profiler on code with pernicious caching?
Part of why I'm the "Cleaner" is because I've worked with any number of people who will present "evidence" that everything that can be done has been done. Here are the metrics from a year ago when we turned on the cache. Here are the flame charts. See? Nothing to see here. They get really uncomfortable six months later when I'm reporting 2-3x in code that can't be optimized further. And in each case everything was made more difficult by the people ahead of me chumming the waters.
They were partially correct. There was in fact nothing left for them to see, so nothing that they knew to do. But "there's nothing to see here" and "there's nothing here" are two very, very different statements.
There's a somewhat tangentially related heuristic to your points imho. When writing code, when in doubt, task CPU over memory, it will generally yield leaner, better software. Easier to profile, debug, and optimize. Much easier to scale in production environments too.
Code which optimizes for fast running time, tends to lean heavily into using memory-bloating data structures and caches. These are often poison for aggregate performance in real world environments, and like you said among other things, reduce visibility in the workings of system.
There's some implications of this stance; e.g. optimizations to reduce memory footprint are generally never wasted, while optimizations to squeeze out a few ns of runtime may have many more unexpected consequences.
The issue with caches is they are easy to apply. You can easily drop a cache in any place you load a resource without significant code changes.
I tend to think they are ok weapons of sometimes first resort (rarely changing resource with heavy system impact). But you'll definitely get better results simply by rethinking and rearchitecting stuff. (Did you really need to load the database just to lookup a single value?)
> You can easily drop a cache in any place you load a resource without significant code changes.
Only when the cache is brand new. The problem with people with cache addiction is that they think they're free. So once they know a cache is there, they begin to lean on it, and lean on it. I've seen it with many different teams in a number of verticals. Sometimes while I'm there, sometimes before I arrived. It's the people not the company.
This illusion that you can always turn it off later is dangerous. That's only true if you have an impeccable information architecture when you start. And if you have that, you've already done a lot of heavy lifting before caching becomes your only option.
> How do you teach people to effectively use a profiler on code with pernicious caching?
Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
> They get really uncomfortable six months later when I'm reporting 2-3x in code that can't be optimized further. And in each case everything was made more difficult by the people ahead of me chumming the waters
Well, they delivered (I'm guessing) 10x improvement in a month by adding cache. Caching is entirely reasonable first step and it can be good enough to be last one.
And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
The more the dynamic behavior of the profiled system and the real system deviate, the harder it is to isolate what's causing the slowdowns.
> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
Yes, but by doing this they also tremendously increased the complexity of the project, and (unless the team has someone like GP in it) prevented further optimizations. It may still be a good trade-off, but the overall sorry state of pretty much all software everywhere suggests it may not be a good one usually.
> And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
That's kind of reinforcing the parent's point, though.
>> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve.
>The more the dynamic behavior of the profiled system and the real system deviate, the harder it is to isolate what's causing the slowdowns.
Whose quote you repeat without thinking ? It's not that relevant if you know how profiling works...
>> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
>Yes, but by doing this they also tremendously increased the complexity of the project
Not really in most cases its pretty simple to add it, and if you profiled correctly you'd start with only adding extra code in hottest part of the app
> and (unless the team has someone like GP in it) prevented further optimizations.
Uh, just no. It might've fixed problem well enough to not be touched but it ain't stopping anything, just a different solution.
> It may still be a good trade-off, but the overall sorry state of pretty much all software everywhere suggests it may not be a good one usually.
The assumption here that the same developers that fucked up caching would somehow manage to achieve same speedup with "just" optimizations and less bugs is bold one.
Hell, many low level optimizations are also basically "using cache" (but CPU one) well, althought compiler usually gets you out of most egregious mistakes (unless you decide to write multithreaded C/C++ carelessly I guess.
>> And frankly ability to use profiler and knowing some architectural optimizations seems to be pretty rare skill.
> That's kind of reinforcing the parent's point, though.
What point exactly ? That average developer is bad at their job ? Poster above explicitly said that the worse developers claimed it can't be optimized more in the first place, what else you expect from them ? Cache was the most cost-efficient solution most likely.
> The more the dynamic behavior of the profiled system and the real system deviate
Maybe. I've never found it a problem, but maybe. When I've added a cache, the cache comes with instrumentation for exactly that.
> Yes, but by doing this they also tremendously increased the complexity of the project
It depends. EG if you don't need to worry too much about cache invalidation, adding caching can be as simple as adding a single function call eg like python's lru_cache. It depends on the use case.
> Profile the function that is called if object is not found in cache ? Have pluggable cache with plugin that just doesn't cache? That is not hard problem to solve
Okay, now I have a profile that says 90% of the time is spent in the code the cache avoids. Now what? Narrator's note: When the cache was put in place it accounted for 40% of run time, but now people use it because it's 'free' and so the code has grown from looking up the data twice to looking it up five to seven times depending on conditional logic.
> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache.
No, they didn't. That's the problem. No cache gets you 10x. 10x for one step of a 20 step process is a positive improvement but rarely more than 2x for the first one, 4x at the outside. If it does more that that then someone is being titanically stupid. From then on each cache is going to net you less, and less, and less. Only if someone measured wrong or you are blocked by some architecture change that's needed so that caches even make sense, will you see a bigger improvement this year than you did last.
But the pathology I'm discussing above didn't even have anything at all to do with caching. It's the psychology of (ir)rationalizing around performance, the assertion that you've done everything that can be done because you've done everything you know to do. It's a bizarre love child of learned helplessness and ego self defense. It's gotten way worse in a world where people reach for caches first (and thus, I assert, last), but it was always there.
I have a couple coworkers who are trying to be clever with a new cache architecture. Irritatingly by taking a statement of fact made by me 3+ years ago and jumping to the wrong conclusion. They promised 50% in 18 months (I was suggesting we target 90% in three years). Now we are somewhere past 30 months and so far they've delivered lower numbers than I did in 5 man weeks spread out from Christmas to mid March, while juggling other priorities and suffering from Seasonal Affective Disorder. Another coworker, usually cache averse, was making noises about adding a cache for some other pain point, and I started pulling on the thread for a bit until he was satisfied. It'll be next March before they surpass the 20% I turned in with about 8 modest changes. And the architectural improvement from both efforts will be roughly on a par. Top down is sexy, bottom up is not. But it's hard to do the former without the latter. Eat your vegetables.
>> Well, they delivered (I'm guessing) 10x improvement in a month by adding cache
>No, they didn't. That's the problem. No cache gets you 10x.
I was actually being conservative. We had almost 100x increase in some cases.
It was pretty particular app, site took few hundred (IIRC closer to second mark) ms to generate before we started at all.
First it was simple Varnish cache but that had problems, some components never changed, other were updated pretty often (stuff like "current users here")
First it was split to various components, each with different TTLs, then all of that was merged in Varnish ESI.
So you could say invalidate a part of the site and varnish would
* load this component from app server
* load the template it contains from its own cache
* render it.
So
* no refresh would be just direct from cache, single digit ms or lower latency
* refreshing one part of site would have latency of rendering only that tiny part of site on app server
later we also updated that with "grace" feature of Varnish where you could define time interval where cache would be valid but generate request to the backend to refresh
So if said low TTL component was refreshed often
* the component will always go from cache with lowest latency
* if it was on its last seconds varnish would send the request in the background and refresh cache
Now the app itself wasn't very well written in the first place (there is a book of horror stories to write about it) but they (we did ops and testing, they did the app stuff) spent less time than "fixing it properly" and instead rewrote it to a bit more modern architecture. They still use same cache to cache JSON responses to API endpoints tho.
That not even to say how much of DB performance is actually clever caching, after all index is just pre-generated partial cache...
But sure, some apps are better for it than other, 10x is definitely not universal.
> From then on each cache is going to net you less, and less, and less. Only if someone measured wrong or you are blocked by some architecture change that's needed so that caches even make sense, will you see a bigger improvement this year than you did last.
Well, doh, caching something you cached already ain't gonna be that effective. But if you see same request with same done 100 times in a minute, making it take 20ms instead of 100m ain't gonna be as effective as making it take 100us to get from cache. You might want to address it eventually to have lower jitter, but same cache can allow you to make 20 other methods faster on average and that is less code and effort than optimizing 20 other functions..
> I have a couple coworkers who are trying to be clever with a new cache architecture. Irritatingly by taking a statement of fact made by me 3+ years ago and jumping to the wrong conclusion. They promised 50% in 18 months (I was suggesting we target 90% in three years). Now we are somewhere past 30 months and so far they've delivered lower numbers than I did in 5 man weeks spread out from Christmas to mid March, while juggling other priorities and suffering from Seasonal Affective Disorder.
And do you think they would be able to instead go and optimize it better in that time ? That just sounds like incompetent co-worker and not anything to do with cache...
That guy just keeps repeating "it's hard" then 'explaining' why in obtuse terms and in ways that don't stand up (I don't think he understands database behaviour). Frankly I don't think he has much experience in this area. I'm going to give up now, he's spent a fair bit of your and my time and still can't accept he's not right.
I've done much better. Think sticking a a cache in front of a database with seriously complex queries. Yes you need to be be aware of staleness, no it's not a magic wand, but it delivered massive improvements.
> When the cache was put in place it accounted for 40% of run time
If looking shit up in a hash table, at a few hundred or thousand cpu ticks, cost 40% of your runtime then I smell PEBKAC[1], not cache problems.
What makes caching one of the hardest parts of computing is this assertion that it's easy. Everything is easy when you oversimplify it. Some things explode in complexity when you try to force them to be simple.
> Some things explode in complexity when you try to force them to be simple.
do you mean to say iow "simple in the small but when assembled (in the large) is more complex" whereas what we should be aiming for is perhaps "a little more complex in the small but with the benefit that it enables a simpler (and probably more scalable) architecture once assembled (in the large)"??
I kinda feel that your convincing counterargument is just ‘I have a different experience’.
I’ve never seen people reach for caches first, and then discard all further optimizations.
Or rather, I’ve seen them do that, but not out of some misplaced feeling that it was the only solution, just the one that was easiest to accomplish in very little time.
It's dificult to refute your argument because I don't understand your points:
> Because of the locality aspect, people will assume that it's 'free' to look up a value multiple times instead of passing these data dependencies between the relevant methods
what does that even mean? What does 'data dependencies' mean specicially?
> then you can end up evicting data in the middle of a transaction
So in the middle of a transaction, instead of looking up data from a key - very cheap - you should do ... what? Recalculate it instead?
> The less obvious consequence is now you have concurrency bugs, because the code assumes that decisions made about the data will go the same way on each call, but due to eviction, two halves of the transaction may make conflicting decisions with undefined behavior
Assuming I even understand this, you are saying where there's concurrencty there's a risk of pulling stale data out of a cache. Well, true. That's just something you have to be careful of. Add a Tx reference to the cache key.
> Once you introduce caching you poison the performance well.
Not my experience at all. I can't understand why you're saying this.
> Flame charts lie when cache is used
How so?
> and after people rely on the cache it is lying in the other direction when the cache is off, because you're calling that path 5 times now and it's swamping other bottlenecks
I've no idea what this means
> This is what I really mean when I say it's a weapon of last resort. Because once you use it, it attempts to take any other options off the table
Heh! Absolutely not! Other optimisations are orthogonal to caching IME:
Better algo - orthogonal
Go lower level - orthogonal
Go multicore - orthogonal
Noobish SQL query is begging for a rewrite - orthogonal
Use L1/2/3/ chip cache better, or RAM, by careful data layout - orthogonal
Use a better library - orthogonal
Use better data structure eg. than linked list - orthogonal
>> Because of the locality aspect, people will assume that it's 'free' to look up a value multiple times instead of passing these data dependencies between the relevant methods
> what does that even mean? What does 'data dependencies' mean specicially?
It means that instead of having a method like `entryPoint(int)` that looks up a `Foo` by id and then passes the `Foo` around (`verify(Foo)`, `Foo update(Foo)`) you instead wind up with a system that passes around an `int` (`verify(int)`, `/* Unused */ Foo update(int)`) and each layer then looks up `Foo` in the `FooService` (because `FooService` has a cache and subsequent lookups should be essentially-free).
Even worse when `FooService` is abstracted behind a thread-local like `ThreadLocal<Foo> currentFoo`, the implementation of which is `fooService.findById(currentFooId.get()) /* cheap, because fooService has a cache */`.
My own take on this is that caches are global-by-default when you really want local-by-default (e. g. dynamic programming instead of a `static final Cache theCacheForTheEntireProgram`). Local-by-default caches are easy to make transactional (when the call stack returns the cache is gone) and don't grow in an unbounded manner in the general case.
Local by default also has the feature/problem of making you look at the scale of the data involved in your query up front, before it's having a 100-1000 req/s hitting it in production. It's harder to pretend this large dependency graph will just magically take care of itself (or worse, be someone else's problem after you've declared victory) when it's staring you in the face.
In my extensive experience of optimizing stuff it is right.
I cannot even count how many issues were created downstream of the decision to start caching X, Y, or Z. At a point, every other software engineer gets the bright idea to cache their special little function or request or database call. Soon enough, there are N layers of cache responsible for delivering some special piece of data. The engineers build tooling to clear those caches on updates, either manually or automatically. Both suffer from issues, the latter occasionally failing to perform and just causing general confusion about why stale data is returning, and the former being far worse, recruiting dozens of business stakeholders into the problem of knowing-how-to-expire-cache. They start making their own internal wiki pages with steps or various cache tools to clear, often incorrect steps because they got lucky one day by clearing cache X and not cache Y that they did not know about.
Overall, in my experience, allowing software engineers to do caching as they wish without any real guardrails has caused massive problems most likely on the order of N!. I cannot roll my eyes far enough into the back of my head when I hear someone suggest caching as a solution any longer, at least not on an application that has more than a dozen business stakeholders managing data and expecting it to be updateable in any reasonable way so that customers see the most recent data when the business wants it.
I have one particularly baroque data path that I think passes through three layers of cache. The same people worked on it the whole time, and I'm 90% sure that at least one of those layers is doing absolutely nothing, except making the code harder to read. Once they're added nobody takes them out. Like all global state it's problematic to be absolutely sure it's not being used anywhere.
Caching can turn O(n²) algorithms into much faster algorithms. I would point to how transformative tabling is in Prolog, not to mention memoization rescuing the calculation of fibonacci numbers by recursion, which otherwise would be gross malpractice.
Caching is fine is there's no state involved. e.g: pure functions like the fibonacci sequence.
As soon as your cache depends on a mutable state, things get really hairy. Cache invalidation is one of the 2 hard problems in computer science, after all.
If it's really no state it should be a compilation step.
After watching some coworkers play keystone cops for 5 years I'm preparing to do just that to the last project they were responsible for. Their new project should have been that as well, but no it's going to be cache.
This gives me flashbacks to Intel's great disaster of Itanium which seems so ridiculous in retrospect but I guess very few people understood superscalar processors in the 1990s.
I did some serious thinking about processors that use the transport triggered architecture
which are a simple way to make a custom CPU (for the first time I thought, "I can make a custom CPU!") but do operations in lockstep in such a way that the whole thing blocks if a memory operation stalls. You might be able to get away with however because the TTA is meant to be codesigned with the code it runs on.
Itanium was smarter than a TTA but it was predicated on the idea that a compiler could know how operations can be interleaved at run time. The compiler has no visibility into what gets cached, but at runtime the processor can opportunistically run operations that aren't stalled. For a general-purpose processor that's supposed to run general code, conventional superscalar processors do better than VLIW processors that require compilers to guess ahead of time what is impossible to guess.
VLIW works _really_ well on a deterministic ISA. As in when you know how long a load is going to take at compile time because you know it's N clock cycles from that type of memory.
Otherwise I really like the GPU derivative form where N instructions are ready to go and the hardware picks one where the prerequisite memory access has completed, despite that being variable.
> Otherwise I really like the GPU derivative form where N instructions are ready to go and the hardware picks one where the prerequisite memory access has completed, despite that being variable
Hmm, isn't that OOO (out of order) execution? And isn't that something GPUs explicitly don't do? (AFAIUI)
I'm probably describing it poorly. A GPU (at least the AMD ones) has N execution units (of a few different types) and M threads (wavefronts) waiting to be scheduled onto them. Each cycle (ish), hardware picks one of the M threads, runs it for an instruction, suspends it again. Memory stalls / waits are handled by not scheduling the corresponding thread until it's ready. The operation might be a CAS over pcie so it's really variable in latency.
Provided you have enough (ideally independent) work item / task / coroutine things available to enqueue, and you're designing for throughput instead of latency, that keeps the execution units or memory bandwidth fully occupied. Thus win.
The x64 style model is deep pipeline + branch predict / speculation and it all goes slow when the guess fails. In contrast the GPU doesn't have to worry about guessing a branch destination incorrectly and unwinding wasted work. It does have to worry about having enough work queued up, and distribution of tasks across the compute unit / symmetric multiprocessor units, and the hardware-picks-the-order leaks into the concurrency model a bit.
Wouldn’t this whole problem be solved by a graph-based architecture, used already by some compilers as an intermediate? Then the CPU sees where and when everything is needed and schedule the loading of registers as optimally as it is feasible, at runtime.
These are very well-behaved instances of caching. They are so well-behaved that they can be fully described with a little bit of analysis. Once I/O is involved or the load isn't well-known beforehand, it becomes much more difficult. Lazy I/O in Haskell is a very divisive topic for a reason.
Memoization is caching and a form of dynamic programming; thus caching is sometimes an important tool in dynamic programming.
But the fact that caching can be useful, even essential if your scenario demands such a clear-cut space-time tradeoff as in memoization, seems beside your point.
Well, usually memoization gets you polynomial time from exponential time with the space tradeoff. It's unusual to start with `O(n^2)` and get a better DP solution.
> Caching is a weapon of last resort. In most cases it is the last major performance improvement
...is wrong. It is one of the very first things I reach for because it is the most effective and simplest, once you've eliminated O(n^2) behaviour etc. It's not a magic wand but nothing is.