I'm trying to understand how this works.. does it actually run the model on the ...

annjose · 2024-01-31T04:09:44.000000Z

Yes, it runs a quantized [1] version of the model locally. This version uses low-precision data types to represent reduced weights and activations (8-bit integer instead of 32-bit). The specific model published by Ollama uses 4-bit quantization [2] and that's why it is able to run on MacBook pro.

If you want to try it out, this blog post[3] shows how to do it step by step - pretty straightforward.

[1] https://huggingface.co/docs/optimum/concept_guides/quantizat...

[2] https://ollama.ai/library/codellama:70b

[3] https://annjose.com/post/run-code-llama-70B-locally/

kungfupawnda · 2024-01-31T13:54:33.000000Z

Thanks, I got all but the 70b model to work. It slows to a crawl on the Mac with 36 gb ram.

annjose · 2024-01-31T18:28:07.000000Z

Cool. I am running this on M2 Max 64GB. Here is how it looks on my terminal [1]. Btw, the very first run after downloading the model is slightly slow, but the subsequent runs are ok.

[1] https://asciinema.org/a/fFbOEfeTxRShBGbqslwQMfJS4 Note: This recording is in real-time speed, not sped-up.

kungfupawnda · 2024-01-31T19:22:10.000000Z

I am returning the M3 max 36 GB and picking this model instead. Saves me a grand and it seems to be much more powerful..

mkevac · 2024-02-08T11:46:58.000000Z

What do you mean you are returning it? It has been used already.