Well I guess it's the “blowing up parameter count to make up for it” that confus...

int_19h · on Feb 28, 2024

With existing common quantization techniques, a 70b model quantized to 3-bit still drastically outperforms an unquantized 35b model.

p1esk · on Feb 29, 2024

Are you sure? I was under impression that 3b quantization still results in a significant degradation. Which quantization method are you talking about?

int_19h · on Feb 29, 2024

It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:

https://github.com/ggerganov/llama.cpp/pull/1684#issue-17396...

p1esk · on Feb 29, 2024

Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.