You don't need 16-bit quantization. The difference in accuracy from 8-bit in mos...

int_19h · 2025-03-25T08:26:30 1742891190

Even 4-bit is fine.

To be more precise, it's not that there's no decrease in quality, it's that with the RAM savings you can fit a much better model. E.g. with LLaMA, if you start with 70b and increasingly quantize, you'll still get considerably better performance at 3 bit than LLaMA 33b running at 8bit.

elorant · 2025-03-25T12:14:55 1742904895

True. The only problem with lower quantization though is that the model fails to understand long prompts.