>The problem with using softmax is that it forces each attention head to make an annotation, even if it has no information to add to the output vector. Using softmax to choose among discrete alternatives is great; using it for optional annotation (i.e. as input into addition) is, like, not cool, man. The problem here is exacerbated with multi-head attention, as a specialized head is more likely to want to “pass” than a general-purpose one. These attention heads are needlessly noisy, a deafening democracy where abstention is disallowed.
Can't the MLP that processes the concatenated outputs the attention heads handle this? I don't understand why it should be critical that a head be allowed to put something close to zero in its segment of the concatenated vector if it's immediately going to get projected by an MLP anyway.
But you are wasting some of the model's capacity to learn to ignore some of that information. I think it wouldn't hurt. However, if I followed the reasoning correctly, I think the biggest win is to reduce the range of the weights more than improving performance.
> This is what’s been happening in LLMs – for reasons that are only partially understood, Transformer models contain these outlier weights and are emitting Black Swan mega-activations that are much, much, much larger, like orders of magnitude larger, than their peers ...
meaning that once quantized you can either have a finer quantization since the range of possible values is smaller or you can pick a coarser strategy that saves bits for each weight.
Right, I get the goal of removing the outlier activations, but I just don't understand why outlier activations are a consequence of the model trying to "pass". The story from the linked paper earlier in the post (https://arxiv.org/pdf/2306.12929.pdf) is that the model is doing the following:
-Learn a near-zero representation for some otherwise low-importance token, like delimiters or whitespace.
-When a head wants to "pass", emit an outlier activation to attend to that token nearly-exclusively.
But I'm surprised the model can't just use its existing tools (the post-concat projection layer and the following MLP block) to achieve the same thing. And if the answer is that it could do that, but tends to learn to use the outlier activation trick instead, will giving it a new tool that still allows the use of outlier activations be sufficient?
The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.
Can't the MLP that processes the concatenated outputs the attention heads handle this? I don't understand why it should be critical that a head be allowed to put something close to zero in its segment of the concatenated vector if it's immediately going to get projected by an MLP anyway.