More ELI5 than the other comments. Considering the softmax network:
During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20. Remembering that 95% of our values are below 100, we would only have about 5 discrete values for 95% of our values - so we would be losing a lot of "resolution" (entropy/information). For example (assuming rounding is used), an original value of 19 would be quantized to 20 and 30 would be quantized to 40. The original values differ by 11, but the quantized values differ by 20!
This is where exotic encodings come into play. We might try to use a logarithmic scheme, for example. This would result in higher value densities at lower values - but we would probably still waste bits and it would require more APU cycles.
Now switch to the softmax1 network:
The range of values is less important than the distribution - instead of 95% of the values falling in a small range, we would see the values more evenly spread out. Assuming that the range is now 105 (so the 5% outlying neurons from the softmax network are still >100), we would have 243 values to represent everything under 100. The same example with 19 and 30 would result in 19.27 and 30.34 respectively, a difference of 11.07 - which is very close to the unquantized difference of 11. We have retained more information in the quantized version of the network.
Information is lost either way, but what's important is how much information is lost.
The reason that the large values appear is because the heads attempt to "scream really loud" when they are certain that they are right. This is an emergent behavior due to softmax - it ironically sucks at paying attention to a few of the heads: it boosts the volume of the heads that are trying to abstain, and mutes the volume of the heads that are trying to vote.
> During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20.
Instead of using an 8bit integer with even step size quantification, wouldn't they still use an 8bit float?
No one quantizes blindly without accounting for data. If 95% of your values are in 0-100 you’ll probably do something like have 20 values for 0-100 and the remaining 12 for 101-5000. You don’t have to apply a uniform distribution and shouldn’t when your data is that concentrated.
If I'm following correctly, does this mean that with this change along with a model being quantized, we could see models that are 5% the size (on file system) and memory usage but almost identical in output?
During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20. Remembering that 95% of our values are below 100, we would only have about 5 discrete values for 95% of our values - so we would be losing a lot of "resolution" (entropy/information). For example (assuming rounding is used), an original value of 19 would be quantized to 20 and 30 would be quantized to 40. The original values differ by 11, but the quantized values differ by 20!
This is where exotic encodings come into play. We might try to use a logarithmic scheme, for example. This would result in higher value densities at lower values - but we would probably still waste bits and it would require more APU cycles.
Now switch to the softmax1 network:
The range of values is less important than the distribution - instead of 95% of the values falling in a small range, we would see the values more evenly spread out. Assuming that the range is now 105 (so the 5% outlying neurons from the softmax network are still >100), we would have 243 values to represent everything under 100. The same example with 19 and 30 would result in 19.27 and 30.34 respectively, a difference of 11.07 - which is very close to the unquantized difference of 11. We have retained more information in the quantized version of the network.
Information is lost either way, but what's important is how much information is lost.
The reason that the large values appear is because the heads attempt to "scream really loud" when they are certain that they are right. This is an emergent behavior due to softmax - it ironically sucks at paying attention to a few of the heads: it boosts the volume of the heads that are trying to abstain, and mutes the volume of the heads that are trying to vote.