This is neat, but what I really want to see is someone running it on 8x 3090/409...

gatienboquet · 2025-02-01T20:10:07 1738440607

According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.

You can rent a single H200 for 3$/hour.

MaxikCZ · 2025-02-01T20:02:57 1738440177

I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..

deoxykev · 2025-02-01T16:16:19 1738426579

8x 3090 will net you around 10-12tok/s

bick_nyers · 2025-02-01T18:50:56 1738435856

It would not be that slow as it is an MoE model with 37b activated parameters.

Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).

deoxykev · 2025-02-03T17:45:13 1738604713

Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.

Benchmarks: https://github.com/ggerganov/llama.cpp/issues/11474#issuecom...

rdlw · 2025-02-05T00:12:36 1738714356

Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?