> Where do you move the weights anyway once they are loaded into the GPU VRAM? T...

menaerus · 2025-06-25T06:07:13 1750831633

Round-trip between VRAM and GPU registers? That's what the cache hierarchies are for. I think you confused quite a bit of concepts here.

Moving data to and from VRAM is ~100ns of latency. Moving data from RAM to VRAM through PCIe 5.0 is 1-10us of latency. So, ~1 to ~2 orders of magnitude of difference.

And this is the reason why batching is used - you don't want to pay the price of that latency for each and every CPU-to-GPU request but you want to push as much data as you can through a single round-trip.

hexaga · 2025-06-25T11:19:42 1750850382

Model weights are significantly larger than cache in almost all cases. Even an 8B parameter model is ~16G in half precision. The caches are not large enough to actually cache that.

Every weight has to be touched for every forward pass, meaning you have to wait for 16G to transfer from VRAM -> SRAM -> registers. That's not even close to 100ns: on a 4090 with ~1TB/s memory bandwidth that's 16 milliseconds. PCIe latency to launch kernels or move 20 integers or whatever is functionally irrelevant on this scale.

The real reason for batching is it lets you re-use that gigantic VRAM->SRAM transfer across the batch & sequence dimensions. Instead of paying a 16ms memory tax for each token, you pay it once for the whole batched forward pass.

menaerus · 2025-06-25T11:58:25 1750852705

You've made several incorrect assumptions and I am not bothered enough to try to correct them so I apologize for my ignorance. I'll just say that 16ms memory tax is wildly incorrect.

namibj · 2025-06-25T16:58:58 1750870738

You are either having a massive misconception of GPT-like decoder transformers, of how GPU data paths are architected, or are trolling. Go talk to a modern reasoning model to get yourself some knowledge, it's gonna be much better than what you appear to have.

pests · 2025-06-25T08:14:35 1750839275

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

menaerus · 2025-06-25T11:46:34 1750851994

Not really. Registers are irrelevant. They are not the bottleneck.

pests · 2025-06-25T19:02:52 1750878172

Computation happens in the registers. If you’re not moving data to registers you aren’t doing any compute.