More

yvbbrjdr · 2025-10-14T05:48:37 1760420917

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.

EnPissant · 2025-10-14T06:44:46 1760424286

Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.

hnuser123456 · 2025-10-14T13:38:50 1760449130

Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.

EnPissant · 2025-10-14T18:20:58 1760466058

I'm saying its less than half of this DGX Spark: dual channel DDR5-6000 vs quad channel LPDDR5-8000.

yvbbrjdr · 2025-10-14T07:05:20 1760425520

We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.

lostmsu · 2025-10-14T12:05:04 1760443504

The parent is right, the issue is on your side.

yvbbrjdr · 2025-10-14T05:48:20 1760420900

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.

ggerganov · 2025-10-14T05:59:34 1760421574

FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |

rajatgupta314 · 2025-10-14T06:41:05 1760424065

Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.

Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...

Example looking at the same weight on Ollama is BF16:

https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360

yvbbrjdr · 2025-10-14T06:03:57 1760421837

I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..

alecco · 2025-10-14T09:01:03 1760432463

Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.

Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.

ilc · 2025-10-14T11:34:02 1760441642

Was. They've been diverging.

xs83 · 2025-10-14T10:34:10 1760438050

Now this looks much more interesting! Is the top one input tokens and the second one output tokens?

So 38.54 t/s on 120B? Have you tested filling the context too?

ggerganov · 2025-10-14T14:56:47 1760453807

Yes, I provided detailed numbers here: https://github.com/ggml-org/llama.cpp/discussions/16578

nialse · 2025-10-14T18:51:37 1760467897

Makes sense you have one of the boxes. What's your take on it? [Respecting any NDAs/etc/etc of course]

__mharrison__ · 2025-10-14T06:22:18 1760422938

Curious to how this compares to running on a Mac.

xs83 · 2025-10-14T10:34:55 1760438095

TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB

Eggpants · 2025-10-15T04:01:01 1760500861

So so many… eBay search shows only 15 results, 6 of them being ads for new systems…

https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...

yvbbrjdr · 2025-04-18T22:55:58 1745016958

We use DeepSeek v3 in prod. Works even better than GPT-4o.

yvbbrjdr · 2025-01-15T22:24:13 1736979853

Join the Discord server to see Athena in action! https://discord.gg/X38GnhdTH8

yvbbrjdr · on July 19, 2022

Yes. What I'm targeting is the mass population, and I really need to figure out a way to solve this dilemma.

yvbbrjdr · on July 18, 2022

Under the hood, we used Argon2i algorithm to derive the secret key from an arbitrary-long password string. We used the term "password" because that's what ordinary people will understand (like, zip uses the same term for their secret keys). In practice, people should choose password that's long enough to prevent brute forcing, just like picking a password for your online accounts.

It's a good idea to use a public key system. But it really confuses new users who has never used PKI before. Nevertheless, we have a key exchange feature built into the app that allows 2 parties to negotiate a shared secret using X25519, for advanced users.

yvbbrjdr · on July 18, 2022

Wow! This project seems to do exactly what ours does right now.. with an even better UI/UX.. but they don't seem to support any kind of nonce'ed and key'ed encryption?

For some reason it's no longer on the App Store anywhere.

yvbbrjdr · on July 18, 2022

Interesting project! But yeah it requires little effort to detect

yvbbrjdr · on July 18, 2022

Yes! I'm right now creating a steganography scheme with an NLP model.

We've actually tried implementing a browser extension before. The problem is ordinary people just don't use browser extensions..

gnicholas · on July 18, 2022

Ordinary people can use browser extensions OK on desktop, but on mobile it's a mess. Chrome for Android doesn't support extensions, and no one uses the Android browsers that do. Installing an extension for Safari on iOS requires following many unintuitive steps. I hope mobile extensions become easier to install/use with time!

keurrr · on July 18, 2022

The original version was a browser extension. It was very painful to maintain support for all the different types of input fields. Most large social media sites do not use standard text areas.

https://github.com/XCF-Babble/babble

yvbbrjdr · on July 18, 2022

Thanks for your advice! However, there are several problems with self hosted platforms in China.

1. People are unaware of their existence due to those projects being very technical and hard to deploy/join. They also don't have a good client on mobile platforms. People will trade their privacy for all the convenience, say, WeChat brings, because all of their contacts are already using WeChat. It's hard to convince people to change to use your matrix server. 2. Cloud services are also monitored by the government. There are programs running in the background inside VPSes that monitors all processes in your server. 3. If you want to host a website, you have to register it with a state agency, so if there are any contents on your website that the government doesn't like, your website will be shut down and you'll be held responsible.

As of the walled garden Apple created, I heard that EU has passed a law mandating Apple to allow third-party app stores. It'll be very interesting to see what'll happen in the future.

rob_c · on July 20, 2022

As for getting people to join. LEAVE THE APPLE WALLED GARDEN. After that it's entirely as easy as sticking up QR codes of equivalent.

I'm not talking about working within the system. Buy crypto, and with it rent a self hosted NON CHINESE SERVER, not a website, and do your best to keep the box accessible to known popular not yet banned VPN used in China, for the day when the firewall gets you.

Again, with regard to getting people to join these servies, if people aren't willing to sacrifice some minor discomfort of not using the WeChat interface, they're hardly likely to stand next to you in a street protest.

Yes if you want to start a large viral movement you have to dress it up a little, improve the chinese locale or fix some UI issues, but this is massively easier than starting from a text editor or compiler on a remote box. But if you just want to go viral, use WeChat, get a knocked off account or 10 and expect that knock on the door when they turn up because you're protesting _WITHIN_ the system.

Again, VPN are massively technical but hugely popular even in mainland China (I've known enough Chinese apple users to even know this is the case). People are capable of following "click here" instructions better than most people imagined, otherwise technophobes wouldn't have social media.