> Where do you move the weights anyway once they are loaded into the GPU VRAM?
The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.
So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.
Also, batched prompts and outputs are indeed mathematically independent from each other.
Round-trip between VRAM and GPU registers? That's what the cache hierarchies are for. I think you confused quite a bit of concepts here.
Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.
And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.
Model weights are significantly larger than cache in almost all cases. Even an 8B parameter model is ~16G in half precision. The caches are not large enough to actually cache that.
Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.
The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.
You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.
You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling.
Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.
That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.
I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.
The GPU can’t do anything with weights while they are in VRAM. They have to be moved into the GPU itself first.
So it is about memory round-trips, but not between RAM and VRAM. It’s the round trips between the VRAM and the registers in the GPU die. When batch processing, the calculations for all batched requests can be done while the model parameters are in the GPU registers. Compared to if they were done sequentially, you would multiply the number of trips between the VRAM and the GPU by the number of individual inferences.
Also, batched prompts and outputs are indeed mathematically independent from each other.