Hacker News new | past | comments | ask | show | jobs | submit login

Max RAM for Strix Halo is 128GB. It's not a competitor to the Mac Ultra which goes up to 512GB.

You shouldn't need another GPU to do prompt processing for Strix Halo since the biggest model it can realistically run is a 70B model. Prompt processing isn't going to help much because it has a good enough GPU but its memory bandwidth is only 256GB/s (~210 GB/s effective).




Despite the hype, the 512GB Mac is not really a good buy for LLMs. The ability to run a giant model on it is a novelty that will wear off quickly... it's just too slow to run them at that size, and in practice it has the same sweet spot of 30-70B that you'd have with a much cheaper machine with a GPU, without the advantage of being able to run smaller models at full-GPU-accelerated speed.


There’s so much flux in LLM requirements.

2 to 3 tokens per second was actually probably fine for most things last year.

Now, with reasoning and deep searching, research models, you’re gonna generate 1000 or more tokens just as it’s talking to itself to figure out what to do for you.

So everyone’s focused on how big a model you can fit inside your ram, the inference speed is now more important than it was.


Absolutely.

The thinking models really hurt. I was happy with anything that ran at least as fast as I could read, then "thinking" became a thing and now I need it to run ten times faster.

I guess code is tough too. If I'm talking to a model I'll read everything it says, so 10-20 tok/s is well and good, but that's molasses slow if it's outputting code and I'm scanning it to see if it looks right.


counterpoint: thinking models are good since they give similar quality at smaller RAM sizes. if a 16b thinking model is as good as a 60b one shot model, you can use more compute without as much RAM bottleneck


Counter-counterpoint: RAM costs are coming down fast this year. Compute, not so much.

I still agree, though.


It runs DeepSeek R1 q4 MoE well enough.


It does have an edge on being able to run large MoE models.


The $2000 strix halo with 128 GB might not compete with the $9000 Mac Studio with 512 GB but is a competitor to the $4000 Mac Studio with 96 GB. The slow memory bandwidth is a bummer, though.


  but is a competitor to the $4000 Mac Studio with 96 GB. The slow memory bandwidth is a bummer, though.
Not really. The M4 Max has 2x the GPU power, 2.13x the bandwidth, faster CPU.

$2000 M4 Pro Mini is more of a direct comparison. The Mini only has 64GB max ram but realistically, a 32B model is the biggest model you want to run with less than 300 GB/s bandwidth.


You will be limited to a much smaller context size with half the RAM even if you're using a smaller model.


> Max RAM for Strix Halo is 128GB. It's not a competitor to the Mac Ultra which goes up to 512GB.

What a... strange statement. How did you get to that conclusion?


Why do you think it's strange?


The original poster arrogantly and confidently proclaims that a device that costs like 2000$ isn't going to be able to compete against a 10000$ SKU of another device.

I'm wondering how do you get to such a conclusion?


I honestly don't understand what you're trying to say.


Running something like Qwq 32b q4 with a ~50k context will use up those 128GB with the large KV cache.


So what makes you think Strix Halo with such a weak GPU and slow memory bandwidth can handle 50k context with a usable experience for a 32B model?

Let's be realistic here.

The compute, bandwidth, capacity (if 128GB) are completely imbalanced for Strix Halo. M4 Pro with 64GB is much more balanced.


You're probably right. However with sparse models and MoE, 128GB may be useful.


Of course it's a competitor. Only a fraction of M3 Ultra sold will have 512GB RAM




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: