But you are wasting some of the model's capacity to learn to ignore some of that...

Imnimo · on July 24, 2023

Right, I get the goal of removing the outlier activations, but I just don't understand why outlier activations are a consequence of the model trying to "pass". The story from the linked paper earlier in the post (https://arxiv.org/pdf/2306.12929.pdf) is that the model is doing the following:

-Learn a near-zero representation for some otherwise low-importance token, like delimiters or whitespace.

-When a head wants to "pass", emit an outlier activation to attend to that token nearly-exclusively.

But I'm surprised the model can't just use its existing tools (the post-concat projection layer and the following MLP block) to achieve the same thing. And if the answer is that it could do that, but tends to learn to use the outlier activation trick instead, will giving it a new tool that still allows the use of outlier activations be sufficient?

yldedly · on July 25, 2023

The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.