>The problem with using softmax is that it forces each attention head to make an...

marcyb5st · on July 24, 2023

But you are wasting some of the model's capacity to learn to ignore some of that information. I think it wouldn't hurt. However, if I followed the reasoning correctly, I think the biggest win is to reduce the range of the weights more than improving performance.

> This is what’s been happening in LLMs – for reasons that are only partially understood, Transformer models contain these outlier weights and are emitting Black Swan mega-activations that are much, much, much larger, like orders of magnitude larger, than their peers ...

meaning that once quantized you can either have a finer quantization since the range of possible values is smaller or you can pick a coarser strategy that saves bits for each weight.

Imnimo · on July 24, 2023

Right, I get the goal of removing the outlier activations, but I just don't understand why outlier activations are a consequence of the model trying to "pass". The story from the linked paper earlier in the post (https://arxiv.org/pdf/2306.12929.pdf) is that the model is doing the following:

-Learn a near-zero representation for some otherwise low-importance token, like delimiters or whitespace.

-When a head wants to "pass", emit an outlier activation to attend to that token nearly-exclusively.

But I'm surprised the model can't just use its existing tools (the post-concat projection layer and the following MLP block) to achieve the same thing. And if the answer is that it could do that, but tends to learn to use the outlier activation trick instead, will giving it a new tool that still allows the use of outlier activations be sufficient?

yldedly · on July 25, 2023

The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.