About as soon as GPT-4 came out I said that OpenAI was doomed on the trajectory it was on because they could not afford to develop a GPT-5, GPT-6, etc.
Real innovation comes out of doing a lot of experiments and that means doing experiments quickly with the resources you have. So you do most of your experiments with non-frontier models, enough to make a good prediction of what would happen if you maxxed out your model size, then you go big. That's how you make everyone else have a "DeepSeek moment".
A company like Apple wants to pick something on the frontier and keep advancing on a straight line. Works great if you want to make an M1, M2, M3, ... ARM chip but that's not how progress works in AI today.
You are deeply misunderstanding what the KV cache referred to here is. It’s not for storing data. This is the KV cache that’s part of the model to reduce quadratic compute complexity into linear for self attention. This is not stored on SSD - it’s in VRAM (or CPU if you’re not using a GPU)
They, in fact, mention inference kv cache as use case in readme. The most advanced kv caching uses hierarchy of gpu ram/regular ram/ssd. Seems like they were able to use their storage abstraction for last tier.
KVCache is a technique used to optimize the LLM inference process. It avoids redundant computations by caching the key and value vectors of previous tokens in the decoder layers. The top figure demonstrates the read throughput of all KVCache clients (1×400Gbps NIC/node), highlighting both peak and average values, with peak throughput reaching up to 40 GiB/s
That's because DeepSeek uses MLA which apparently does allow offloading the KV cache. That doesn't apply to all models, particularly the open-weight models that are primarily GQA AFAIK.
Any models allow offloading KV cache, it’s not a matter of model architecture but only of the implementation. The only somewhat difference can be for non transformer models. But for all attention models it’s the same – blob of data per each token. It’s much worse for older models with MHA because their KV cache is just too big, and it’s better for DeepSeek because their KV cache is the smallest. But it’s alright for current generation of GQA models as well.
Are you sure about that? GQA applies self-attention to every KV cache entry. If you're offloading, then you're having to dynamically page in all the KV cache entries into the GPU which is quite slow since the CPU/GPU link only has so much bandwidth. My understanding is that MLA reduces the size of the KV cache & doesn't necessarily attend to every KV token at every step which is why offloading to disk works (i.e. most of the tokens can remain on disk without ever being loaded into the GPU).
Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.
As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.
Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.
About as soon as GPT-4 came out I said that OpenAI was doomed on the trajectory it was on because they could not afford to develop a GPT-5, GPT-6, etc.
Real innovation comes out of doing a lot of experiments and that means doing experiments quickly with the resources you have. So you do most of your experiments with non-frontier models, enough to make a good prediction of what would happen if you maxxed out your model size, then you go big. That's how you make everyone else have a "DeepSeek moment".
A company like Apple wants to pick something on the frontier and keep advancing on a straight line. Works great if you want to make an M1, M2, M3, ... ARM chip but that's not how progress works in AI today.