Hacker News new | past | comments | ask | show | jobs | submit login

I'm trying to understand how this works.. does it actually run the model on the MacBook Pro? Sorry I am totally new to this...



Yes, it runs a quantized [1] version of the model locally. This version uses low-precision data types to represent reduced weights and activations (8-bit integer instead of 32-bit). The specific model published by Ollama uses 4-bit quantization [2] and that's why it is able to run on MacBook pro.

If you want to try it out, this blog post[3] shows how to do it step by step - pretty straightforward.

[1] https://huggingface.co/docs/optimum/concept_guides/quantizat...

[2] https://ollama.ai/library/codellama:70b

[3] https://annjose.com/post/run-code-llama-70B-locally/


Thanks, I got all but the 70b model to work. It slows to a crawl on the Mac with 36 gb ram.


Cool. I am running this on M2 Max 64GB. Here is how it looks on my terminal [1]. Btw, the very first run after downloading the model is slightly slow, but the subsequent runs are ok.

[1] https://asciinema.org/a/fFbOEfeTxRShBGbqslwQMfJS4 Note: This recording is in real-time speed, not sped-up.


I am returning the M3 max 36 GB and picking this model instead. Saves me a grand and it seems to be much more powerful..


What do you mean you are returning it? It has been used already.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: