Why does prompt caching reduce costs? I'm assuming that the primary cost driver ...

burtonator · 2024-08-18T20:09:06 1724011746

Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution.

By caching them they resume from where it left off from before thereby completely bypassing all that computation.

For large contexts this could save a ton of compute!

I think this feature and structured outputs are some of the biggest inventions in LLMs this year.

minimaxir · 2024-08-18T20:14:41 1724012081

Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.

brylie · 2024-08-18T20:37:37 1724013457

I’m building an app with OpenAI, using structured outputs. Does OpenAI also support prompt caching?

cma · 2024-08-18T20:54:34 1724014474

I'm sure internally they use it for the system prompt at least, probably since launch. And maybe for common initial user queries that exactly match.

Onavo · 2024-08-18T21:55:37 1724018137

They are certainly not passing the savings on to the users.

minimaxir · 2024-08-18T21:57:57 1724018277

Yet. I suspect OpenAI will release a similar offering soon. (hooray, free market competition!)

HeatrayEnjoyer · 2024-08-18T23:03:39 1724022219

That $100 billion data center has to get paid for somehow.

minimaxir · 2024-08-18T20:40:24 1724013624

Not currently.

pclmulqdq · 2024-08-18T20:08:17 1724011697

You actually can cache the "output" of a transformer on the prefix by caching what happens in the attention layer for that text string (specifically the "K" and "V" tensors). Since the attention layer is a big part of the compute cost of the transformer, this does cut down FLOPs dramatically.

Scene_Cast2 · 2024-08-18T21:16:11 1724015771

Oh interesting, didn't know. How does this work past the first transformer in the stack?

lonk11 · 2024-08-18T21:32:20 1724016740

My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.

danielmarkbruce · 2024-08-18T22:09:00 1724018940

I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.

logicchains · 2024-08-18T20:08:32 1724011712

The transformer only looks backwards, so if the first part of the sequence (the prompt) doesn't change, you don't need to rerun it again on that part, just on the part after it that changed. For use cases with large prompts relative to the output size (e.g. lots of examples in the prompt), this can significantly speed up the workload.

jncfhnb · 2024-08-18T21:03:05 1724014985

I think most architectures do a layer of normalization on the whole text embeddings before calculating attention which makes this infeasible

Shouldn’t be a huge deal to adjust imo

One of the bigger problems is that closed model providers don’t want to expose the embedding space and let’s users see what they have

danielmarkbruce · 2024-08-18T22:07:51 1724018871

I don't think the normalization makes it infeasible. They should be able to make an adjustment (the reverse of the normalization) in one operation. I think they are caching the attention calcs.

The hard thing (I think) is what to keep in the cache and where to keep it given you are serving lots of customers and the attention calc can be a large set of numbers pretty quickly.

jncfhnb · 2024-08-19T01:11:01 1724029861

It’s only “hard” because they don’t want to let customers supply the cache of course

danielmarkbruce · 2024-08-19T01:36:22 1724031382

I'm not sure that's it. Presumably they want to keep the cache in GPU memory?

jncfhnb · 2024-08-19T02:11:07 1724033467

That’s largely it imo. If you get get the embedding representations you could just recreate the model logic and then it’s no longer closed source.

danielmarkbruce · 2024-08-19T02:17:51 1724033871

no, you couldn't. If they just handed you the embedding layer it wouldn't help that much.

Plus, I'm reasonably certain they are caching the attention scores anyway.

danielmarkbruce · 2024-08-18T22:01:58 1724018518

They cache the results of the attention calc. For certain subsets which are common this makes a lot of sense. I'm surprised they can make it work though, given they are serving so many different users. Someone somewhere did some very clever engineering.

maged · 2024-08-18T20:07:22 1724011642

Why not? It's caching the state of the model after the cached prefix, so that inference workload doesn't need to be run again.