I might be missing something obvious, but I am not sure why everyone in the comm...

alevskaya · on July 24, 2023

Yeah we used to use this in our older models years ago... I don't recall the details exactly, but I don't think it ever did very much.

I certainly don't think it will help at all with stability. Things like Q/K layernorm are better tricks for softmax stability when scaling: https://arxiv.org/pdf/2302.05442.pdf

ggerganov · on July 24, 2023

> I don't recall the details exactly, but I don't think it ever did very much.

How would you have known if the trick actually reduces the outliers in the weights? Even if the transformer quality does not improve overall, having less outliers as a result is very beneficial for more accurate quantization of the data

danielmarkbruce · on July 24, 2023

Are you asking "why would you have bothered to look at"?

The "how" is pretty straightforward.

p1esk · on July 25, 2023

He's questioning the statement: "I don't think [the trick] ever did very much", because no one has yet looked at whether the trick helps reducing outliers in very large models. If it does help with this, as the blog author believes, then it is indeed a very useful trick.

danielmarkbruce · on July 25, 2023

Is he? A surface level reading suggests he's asking "how would you know".. and the answer is... by looking at the parameters. People do that.

>> because no one has yet looked at whether the trick helps reducing outliers in very large models

Given a softmax version doing exactly as the blog post says is baked into a google library (see this thread), and you can set it as a parameter in a pytorch model (see this thread), this claim seems off. "Let's try X, oh, X doesn't do much, let's not write a paper about it" is extremely common for many X.

tudorw · on July 25, 2023

This would seem like a really good argument as to why failures should be written up, otherwise where is the list of what has been tried before?

danielmarkbruce · on July 25, 2023

Yup, it is. But it isn't going to happen.

ggerganov · on July 25, 2023

Yes, I assumed that checking the weights for presence and amount of outliers is not something that is usually done and effects on this can be overlooked. If my assumption is wrong and researchers do usually look at such metrics, then my question is not very relevant.

Agree - the "how" is straightforward

zorgmonkey · on July 24, 2023

If popular models are still making this mistake then it still seems noteworthy and making a blog post or paper to increase awareness definitely seems worthwhile. Also multiple independent discovery of good ideas is quite common.

the__prestige · on July 25, 2023

The question is whether people have attempted quantization (the int8 / GGML / GPTQ approaches) and whether the "flattening" of distribution due to a larger denominator results in a better quantization behavior. You'd have to specifically try quantization with and without the +1 to understand the advantage. OP argues that the advantage could be be significant.

PartiallyTyped · on July 24, 2023

The argument / reasoning is a bit dubious.

Technically softmax is not implemented as presented but through exp(x_i-max(x)), and summing over it in the denom. But maybe I am missing something.

Furthermore, the residuals are used exactly because the networks cant learn the identity function; but they can learn zero; at which point the residual is `f(x): x+g(x)` with being `g:x ~> 0` (ie approximately 0).

It is also the case that `f(x): x+g(x)` makes it easier for gradients to flow through.

mrfox321 · on July 24, 2023

You are misreading things.

Regardless of numerical stability tricks (e.g. exp(x_i-max(x))), you are still simply normalizing the logits such that the probabilities sum to 1.

The blog adds an additional hidden logit (equal to 0) to allow for softmax(x) = 0 when x -> -inf.

PartiallyTyped · on July 24, 2023

How can `x -> -inf` occur in the first place when nearly everything is within [-2,2] and doing a dot product plus before that there's normalization too?

uoaei · on July 25, 2023

The use of the "nearly" in your comment is exactly occluding the issue as presented.

Enough weights don't fall under that "nearly" that we require more bits per weight to cover those edge cases. If we were able to delete the "nearly" we would need fewer bits (smaller models).

PartiallyTyped · on July 25, 2023

So the concern is not that x->-inf due to values but it happens due to numerical issues arising out of lower precision?

uoaei · on July 25, 2023

The idea is that if your range of values is small enough you need fewer bits to distinguish between meaningfully different values. The problem is that exp(x) << exp(y) for sufficiently wide ranges [x, y], so that when normalizing in the softmax and subsequently quantizing you don't get the fidelity you need and too much information is lost between layers. The proposed solution is that modifying the softmax step slightly brings x and y close enough to zero that exp(x) and exp(y) are close enough so that more compact quantizations are useful instead of useless.

Piezoid · on July 24, 2023

Implementations usually replace replace the 1 in the denominator with exp(-max(x)) for this reason.