Well I guess it's the “blowing up parameter count to make up for it” that confuses me, but maybe it's just ignorance.
Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?
I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?
It does result in a significant degradation relative to unquantized model of the same size, but even with simple llama.cpp K-quantization, it's still worth it all the way down to 2-bit. The chart in this llama.cpp PR speaks for itself:
Oh wow, you’re right. Though it seems that they are using very small weight group sizes: either 16 or 32 (fp16 scaling factor per group). In this paper it seems there’s no weights grouping, so it’s a bit apples to oranges.
Like what would be the expected factor of this blow up to make up the difference between ternary and whatever 16 bits encoding they were using?
I mean intuitively I'd expect to need ~10× the symbols to encode the same information? Are they using an order of magnitude more parameters, or is that not how it works?