Hacker News new | past | comments | ask | show | jobs | submit login

If you have 128gb ram, try running MoE models, they're a far better fit for Apple's hardware because they trade memory for inference performance. using something like Wizard2 8x22b requires a huge amount of memory to host the 176b model, but only one 22b slice has to be active at a time so you get the token speed of a 22b model.




I guess they're tired of people buying macs for AI.


I haven’t had great luck with the wizard as a counter point. The token generation is unbearably slow. I might have been using too large of a context window, though. It’s an interesting model for sure. I remember the output being decent. I think it’s already surpassed by other models like Qwen.


Long context windows are a problem. I gave Qwen 2.5 70b a ~115k context and it took ~20min for the answer to finish. The upside of MoE models vs 70b+ models is that they have much more world knowledge.


Do you have any recommendations on models to try?


Mixtral and Deepseek use MOE. Most others don't.


I planted garlic this year. Thanks for documenting! I can’t wait to see what I get harvest time.

I like the Llama models personally. Meta aside. Qwen is fairly popular too. There’s a number of flavors you can try out. Ollama is a good starting point to try things quickly. You’re def going to have to tolerate things crashing or not working imo before you understand what your hardware can handle.



In addition to the ones listed by others, WizardLM2 8x22b (was never officially released by Microsoft but is available).


You can also run the experts on separate machines with low bandwidth networking or even the internet (token rate limited by RTT)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: