Well current popular systems are pretty much limited to Lustre and the new kid W...

rfoo · 2025-02-28T19:48:25 1740772105

> What makes the workload somewhat special is

I'll add that latency also doesn't matter that much. You are doing batched data loading for batch n+1 on CPU when GPUs are churning batch n-1 and copying batch n from host memory at the same time.

So as long as your "load next batch" doesn't run for like >1s it would be fine. But one single "load next batch" on one worker means thousands (if not more) random read.

cyanf · 2025-02-28T15:12:53 1740755573

They’re using the FS for caching the KV caches of past requests. It’s why they’re able to charge so little on prompt cache hit.

jpgvm · 2025-02-28T18:44:47 1740768287

Ahh I missed that. Yes prefix caching and RAG are 2 cases were you will want something like this during inference time.