A model trained on BF16 that within the range of FP16 have effective bit rate of...

Der_Einzige · on Aug 22, 2024

I buy these kinds of arguments but the moment a company or NeurIPS researcher claims even a "tiny" bit of loss happens, I become suspicious. I don't buy most claims of "we get 99%+ of the performance" made in practice.

But yes, I do believe that we will find proper lossless quants, and eventually (for real this time) get "only a little bit of loss" quants, but I don't think that the current 8 bits are there yet.

Also, quantized models often have worse GPU utilization which harms tokens/s if you have the hardware capable to run the unquantized types. It seems to depend on the quant. SD models seem to get faster when quantized, but LLMs are often slower. Very weird.

refulgentis · on Aug 22, 2024

I wonder if there is a case for thinking about it in terms of the technical definitions:

If we start from peering into quantization, we can show it is by definition lossy, unless every term had no significant bits past the quantization amount.

so our lower bound must that 0.03% error mentioned above.