Hacker News new | past | comments | ask | show | jobs | submit login

You don't need 16-bit quantization. The difference in accuracy from 8-bit in most models is less than 5%.



Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.


True. The only problem with lower quantization though is that the model fails to understand long prompts.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: