Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.


Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.


Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.


I'm saying its less than half of this DGX Spark: dual channel DDR5-6000 vs quad channel LPDDR5-8000.


We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.


The parent is right, the issue is on your side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: