Hacker News new | past | comments | ask | show | jobs | submit login

This is neat, but what I really want to see is someone running it on 8x 3090/4090/5090 and what is the most practical configuration for that.





According to NVIDIA. > a single server with eight H200 GPUs connected using NVLink and NVLink Switch can run the full, 671-billion-parameter DeepSeek-R1 model at up to 3,872 tokens per second.

You can rent a single H200 for 3$/hour.


I have been searching for a single example of someone running it like this (or 8x P40 and alike), and found nothing..

8x 3090 will net you around 10-12tok/s

It would not be that slow as it is an MoE model with 37b activated parameters.

Still, 8x3090 gives you ~2.25 bits per weight, which is not a healthy quantization. Doing bifurcation to get up to 16x3090 would be necessary for lightning fast inference with 4bit quants.

At that point though it becomes very hard to build a system due to PCIE lanes, signal integrity, the volume of space you require, the heat generated, and the power requirements.

This is the advantage of moving up to Quadro cards, half the power for 2-4x the VRAM (top end Blackwell Quadro expected to be 96GB).


Yeah, there is a clear bottleneck somewhere in llama.cpp. Even high end hardware is struggling to get good numbers. The theoretical limit should be higher, but it's not yet.

Benchmarks: https://github.com/ggerganov/llama.cpp/issues/11474#issuecom...


Is it possible that eight graphics cards is the most practical configuration? How do you even set that up? I guess server mobos have crazy numbers of PCIe slots?



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: