According to NVIDIA.
> a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.
It would not be that slow as it is an MoE model with 37b activated parameters.
Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.
At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.
This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).
Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.
Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?