Hi, author here. I crowd-sourced the devices for benchmarking from my friends. I...

EnPissant · 2025-10-14T06:44:46 1760424286

Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.

hnuser123456 · 2025-10-14T13:38:50 1760449130

Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.

EnPissant · 2025-10-14T18:20:58 1760466058

I'm saying its less than half of this DGX Spark: dual channel DDR5-6000 vs quad channel LPDDR5-8000.

yvbbrjdr · 2025-10-14T07:05:20 1760425520

We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.

lostmsu · 2025-10-14T12:05:04 1760443504

The parent is right, the issue is on your side.