Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.
Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.
We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.