It's a legit question, the model will be worse in some way... I've seen it discu...

coder543 · on Jan 29, 2024

You don’t stop being andy99 just because you’re a little tired, do you? Being tired makes everyone a little less capable at most things. Sometimes, a lot less capable.

In traditional software, the same program compiled for 32-bit and 64-bit architectures won’t be able to handle all of the same inputs, because the 32-bit version is limited by the available address space. It’s still the same program.

If we’re not willing to declare that you are a completely separate person when you’re tired, or that 32-bit and 64-bit versions are completely different programs, then I don’t think it’s worth getting overly philosophical about quantization. A quantized model is still the same model.

The quality loss from using 4+ bit quantization is minimal, in my experience.

Yes, it has a small impact on accuracy, but with massive efficiency gains. I don’t really think anyone should be running the full models outside of research in the first place. If anything, the quantized models should be considered the “real” models, and the full fp16/fp32 model should just be considered a research artifact distinct from the model. But this philosophical rabbit hole doesn’t seem to lead anywhere interesting to me.

Various papers have shown that 4-bit quantization is a great balance. One example: https://arxiv.org/pdf/2212.09720.pdf

cjbprime · on Jan 29, 2024

I don't like the metaphor: when I'm tired, I will be alert again later. Quantization is lossy compression: the human equivalent would be more like a traumatic brain injury affecting recall, especially of fine details.

The question of whether I am still me after a traumatic brain injury is philosophically unclear, and likely depends on specifics about the extent of the deficits.

coder543 · on Jan 29, 2024

The impact on accuracy is somewhere in the single-digit percentages at 4-bit quantization, from what I’ve been able to gather. Very small impact. To draw the analogy out further, if the model was able to get an A on a test before quantization, it would likely still get a B at worst afterwards, given a drop in the score of less than 10%. Depending on the task, the measured impact could even be negligible.

It’s far more similar to the model being perpetually tired than it is to a TBI.

You may nitpick the analogy, but analogies are never exact. You also ignore the other piece that I pointed out, which is how we treat other software that comes in multiple slightly different forms.

berniedurfee · on Jan 29, 2024

Reminds me of the never ending MP3 vs FLAC argument.

The difference can be measured empirically, but is it noticeable in real world usage.

jameshart · on Jan 29, 2024

But we’re talking about a coding LLM here. A single digit percentage reduction in accuracy means, what, one or two times in a hundred, it writes == instead of !=?

coder543 · on Jan 29, 2024

I think that’s too simplified. The best LLMs will still frequently make mistakes. Meta is advertising a HumanEval score of 67.8%. In a third of cases, the code generated still doesn’t satisfactorily solve the problem in that automated benchmark. The additional errors that quantization would introduce would only be a very small percentage of the overall errors, making the quantized and unquantized models practically indistinguishable to a human observer. Beyond that, lower accuracy can manifest in many ways, and “do the opposite” seems unlikely to be the most common way. There might be a dozen correct ways to solve a problem. The quantized model might choose a different path that still turns out to work, it’s just not exactly the same path.

As someone else pointed out, FLAC is objectively more accurate than mp3, but how many people can really tell? Is it worth 3x the data to store/stream music in FLAC?

The quantized model would run at probably 4x the speed of the unquantized model, assuming you had enough memory to choose between them. Is speed worth nothing? If I have to wait all day for the LLM to respond, I can probably do the work faster myself without its help. Is being able to fit the model onto the hardware you have worth nothing?

In essence, quantization here is a 95% “accurate” implementation of a 67% accurate model, which yields a 300% increase in speed while using just 25% of the RAM. All numbers are approximate, even the HumanEval benchmark should be taken with a large grain of salt.

If you have a very opulent computational experience, you can enjoy the luxury of the full 67.8% accurate model, but that just feels both wasteful and like a bad user experience.