Max RAM for Strix Halo is 128GB. It's not a competitor to the Mac Ultra which goes up to 512GB.
You shouldn't need another GPU to do prompt processing for Strix Halo since the biggest model it can realistically run is a 70B model. Prompt processing isn't going to help much because it has a good enough GPU but its memory bandwidth is only 256GB/s (~210 GB/s effective).
Despite the hype, the 512GB Mac is not really a good buy for LLMs. The ability to run a giant model on it is a novelty that will wear off quickly... it's just too slow to run them at that size, and in practice it has the same sweet spot of 30-70B that you'd have with a much cheaper machine with a GPU, without the advantage of being able to run smaller models at full-GPU-accelerated speed.
2 to 3 tokens per second was actually probably fine for most things last year.
Now, with reasoning and deep searching, research models, you’re gonna generate 1000 or more tokens just as it’s talking to itself to figure out what to do for you.
So everyone’s focused on how big a model you can fit inside your ram, the inference speed is now more important than it was.
The thinking models really hurt. I was happy with anything that ran at least as fast as I could read, then "thinking" became a thing and now I need it to run ten times faster.
I guess code is tough too. If I'm talking to a model I'll read everything it says, so 10-20 tok/s is well and good, but that's molasses slow if it's outputting code and I'm scanning it to see if it looks right.
counterpoint: thinking models are good since they give similar quality at smaller RAM sizes. if a 16b thinking model is as good as a 60b one shot model, you can use more compute without as much RAM bottleneck
The $2000 strix halo with 128 GB might not compete with the $9000 Mac Studio with 512 GB but is a competitor to the $4000 Mac Studio with 96 GB. The slow memory bandwidth is a bummer, though.
but is a competitor to the $4000 Mac Studio with 96 GB. The slow memory bandwidth is a bummer, though.
Not really. The M4 Max has 2x the GPU power, 2.13x the bandwidth, faster CPU.
$2000 M4 Pro Mini is more of a direct comparison. The Mini only has 64GB max ram but realistically, a 32B model is the biggest model you want to run with less than 300 GB/s bandwidth.
The original poster arrogantly and confidently proclaims that a device that costs like 2000$ isn't going to be able to compete against a 10000$ SKU of another device.
I'm wondering how do you get to such a conclusion?
You shouldn't need another GPU to do prompt processing for Strix Halo since the biggest model it can realistically run is a 70B model. Prompt processing isn't going to help much because it has a good enough GPU but its memory bandwidth is only 256GB/s (~210 GB/s effective).