Hacker News new | past | comments | ask | show | jobs | submit login

Why does prompt caching reduce costs? I'm assuming that the primary cost driver is GPU/TPU FLOPS, as opposed to any network / storage / etc costs.

My understanding is that an LLM will take in the stream of text, tokenize it (can be faster with caching, sure, but it's a minor drop in the bucket), then run a transformer on the entire sequence. You can't just cache the output of a transformer on a prefix to reduce workload.




Autoregressive models can't just resume so they have to re-parse the entire prompt again for each execution.

By caching them they resume from where it left off from before thereby completely bypassing all that computation.

For large contexts this could save a ton of compute!

I think this feature and structured outputs are some of the biggest inventions in LLMs this year.


Prompt caching has been a thing for LLMs since GPT-2 (e.g. transformers's `use_past=True`), it's more of a surprise that it took this long for the main LLM providers to provide a good implementation.


I’m building an app with OpenAI, using structured outputs. Does OpenAI also support prompt caching?


I'm sure internally they use it for the system prompt at least, probably since launch. And maybe for common initial user queries that exactly match.


They are certainly not passing the savings on to the users.


Yet. I suspect OpenAI will release a similar offering soon. (hooray, free market competition!)


That $100 billion data center has to get paid for somehow.


Not currently.


You actually can cache the "output" of a transformer on the prefix by caching what happens in the attention layer for that text string (specifically the "K" and "V" tensors). Since the attention layer is a big part of the compute cost of the transformer, this does cut down FLOPs dramatically.


Oh interesting, didn't know. How does this work past the first transformer in the stack?


My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N.

This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.


I had the same question... my guess is you can do a layer by layer cache. Ie a cache in the first layer, then another independent second layer cache, and so on.


The transformer only looks backwards, so if the first part of the sequence (the prompt) doesn't change, you don't need to rerun it again on that part, just on the part after it that changed. For use cases with large prompts relative to the output size (e.g. lots of examples in the prompt), this can significantly speed up the workload.


I think most architectures do a layer of normalization on the whole text embeddings before calculating attention which makes this infeasible

Shouldn’t be a huge deal to adjust imo

One of the bigger problems is that closed model providers don’t want to expose the embedding space and let’s users see what they have


I don't think the normalization makes it infeasible. They should be able to make an adjustment (the reverse of the normalization) in one operation. I think they are caching the attention calcs.

The hard thing (I think) is what to keep in the cache and where to keep it given you are serving lots of customers and the attention calc can be a large set of numbers pretty quickly.


It’s only “hard” because they don’t want to let customers supply the cache of course


I'm not sure that's it. Presumably they want to keep the cache in GPU memory?


That’s largely it imo. If you get get the embedding representations you could just recreate the model logic and then it’s no longer closed source.


no, you couldn't. If they just handed you the embedding layer it wouldn't help that much.

Plus, I'm reasonably certain they are caching the attention scores anyway.


They cache the results of the attention calc. For certain subsets which are common this makes a lot of sense. I'm surprised they can make it work though, given they are serving so many different users. Someone somewhere did some very clever engineering.


Why not? It's caching the state of the model after the cached prefix, so that inference workload doesn't need to be run again.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: