A model trained on BF16 that within the range of FP16 have effective bit rate of 13-bit at max (e5m7). Reasonable quantization (at 8-bit) gets you weight error (i.e. L2 distance on weights) down to < 1e-3, and full diffusion run (30 steps) have result with L2 distance (comparing to unquantized) < 5e-2.
I think there is a line somewhere between 4-bit to 8-bit that will hurt performance (for both diffusion models and LLM). But I doubt the line is between 8-bit to 13-bit.
(Another case in point: you can use generic lossless compression to get model weights from 13bit down to 11bit by just zip exponent and mantissa separately, that suggests the effective bit rate is lower than 13bit on full-precision model).
I buy these kinds of arguments but the moment a company or NeurIPS researcher claims even a "tiny" bit of loss happens, I become suspicious. I don't buy most claims of "we get 99%+ of the performance" made in practice.
But yes, I do believe that we will find proper lossless quants, and eventually (for real this time) get "only a little bit of loss" quants, but I don't think that the current 8 bits are there yet.
Also, quantized models often have worse GPU utilization which harms tokens/s if you have the hardware capable to run the unquantized types. It seems to depend on the quant. SD models seem to get faster when quantized, but LLMs are often slower. Very weird.
I wonder if there is a case for thinking about it in terms of the technical definitions:
If we start from peering into quantization, we can show it is by definition lossy, unless every term had no significant bits past the quantization amount.
so our lower bound must that 0.03% error mentioned above.
I think there is a line somewhere between 4-bit to 8-bit that will hurt performance (for both diffusion models and LLM). But I doubt the line is between 8-bit to 13-bit.
(Another case in point: you can use generic lossless compression to get model weights from 13bit down to 11bit by just zip exponent and mantissa separately, that suggests the effective bit rate is lower than 13bit on full-precision model).