Hacker News new | past | comments | ask | show | jobs | submit login

1. Summary

The author is suggesting that we add 1 to the denominator of the softmax that is used within attention mechanisms (not the final output softmax).

The softmax inside an attention unit allows it to see key/query matches as probabilities; those probabilities support a continuous-valued version of a key-value lookup (instead of 1/0 output of a lookup, we get weights where a high weight = the desired key-value lookup).

Adding 1 to the denominator would change an attention unit by no longer working with a true probability vector of weights, but rather working with weights that add up to less than 1. The motivation is that the network can learn to provide high weights so that the adjusted softmax is very close to a probability vector; and it has a new option to provide all-low weights which give all-low output weights, meaning it can opt out of having high confidence in anything.

(switching to opinion mode)

2. How can we tell if this is good?

2a. We should just try it out: Train an LLM with this, see if it works.

2b. There are two reasons I suspect it won't make a big difference.

First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output. Then we're basically taking an average of a bunch of vectors (vs a weighted average that is more like choosing one of them). Statistically, we expect that averaged vector to be close to zero. In other words, the node already has a way to effectively opt-out by providing a near-zero output vector.

Second, in a transformer, each attention unit has many other learned weights that can support the ability to opt out. Both the V matrix and the feed-forward layer after the attention unit give that module a way to provide low values to the activation function after the feed-forward layer, which would result in a value as small as you like — again, a way to opt out.

3. I appreciate the non-academic tone of the article and the willingness to play around with fundamental ideas. Although I'm not totally convinced by the note, I'd love to read more stuff like this.




The way I understood it, the author is saying that, with this change, big values disappear, and we can then use fewer bits to encode the output of transformers, which means reducing the memory requirements of the network. Memory being the limiting factor to running models large, this would be a big deal.


> The Qualcomm AI researchers found that 97%+ of outlier activations in LLMs occur in whitespace and punctuation positions.

This is striking. If true, why not try to ignore whitespace and puctuation?

In old Latin, scripto continua [1] was a way to write continuously, for the exact same reason: to save space. Other modern languages still do that, and are no less parseable.

Granted, it's unlikely a commercial LLM would become popular if it produced output without spaces or punctuation; but an open source one that promised to be much more compressible, and therefore work on smaller machines, might be super useful.

It's not hard for a human to add spaces afterwards. It used to be a job for beginning journalists at the time of telex machines: press releases were sent in all caps without spaces, and interns were tasked with adding slashes between words. In French it was called "bâtonner les dépêches" (literally: add sticks to press releases -- not sure about the idiomatic English translation).

[1] https://simple.wikipedia.org/wiki/Scriptio_continua


In the Qualcomm paper cited, they explain/hypothesize that Transformers learn to attend to these low-meaning tokens when they want to avoid adding too much extra info to the residual stream. So it's not an issue that the models attend to spaces and punctuation during in these outliers – it's the workaround the models come up with to get around the fact that attention has to go somewhere.

This post's author has a different solution, and one that theoretically could avoid causing large outliers that prevent efficient quantization. These large outliers seem to be an unfortunate side-effect of the models' learned solution.

So getting rid of spaces would do nothing to solve the problem, and would instead force the models to learn a new solution, one that presumably isn't as optimal.


> This is striking. If true, why not try to ignore whitespace and puctuation?

It is initially, but thinking about it some more, there's a lot of information packed in whitespace and punctuation choice.

Scripto continua may have worked because the few readers who lived back then expected it to encode some form of legal or religious prose, but even then they could learn things from the overall shape of the document. LLMs are working in a much richer domain of document types, but the only thing they can "see" is a stream of tokens. There's no spatial or geometric data attached there. So whitespace and punctuation are the only thing an LLM has to make inferences about otherwise textually identical inputs. Such as:

  (see: other)  -- vs -- {see: other}
One being likely a text fragment, the other likely a piece of code.

Or how spacing may imply Markdown or YAML being used. Or how it may imply a list. Or a poem. Or a song. Or specific writing style, such as "lol im a casual who not care bout comms" vs. "I am a distinguished professor, about to retire. Elites like us put two spaces after full stop."


> the few readers who lived back then expected it to encode some form of legal or religious prose

The Latin literature was extremely rich, from Cicero to Tacitus, and was certainly not limited to legal information.

Here's part of your comment with white space and punctuation stripped:

scriptocontinuamayhaveworkedbecausethefewreaderswholivedbackthenexpectedittoencodesomeformoflegalorreligiousprosebuteventhentheycouldlearnthingsfromtheoverallshapeofthedocumentllmsareworkinginamuchricherdomainofdocumenttypesbuttheonlythingtheycanseeisastreamoftokenstheresnospatialorgeometricdataattachedtheresowhitespaceandpunctuationaretheonlythinganllmhastomakeinferencesaboutotherwisetextuallyidenticalinputs

It's a little hard to read, but not that hard. I think one would get used to it.

Also, for creative use of LLM, it may be a feature, as trying to find the words could be inspiring.

I think it would be worth a try.


Now do a modern structured document with sections and bullet points and logical connectives.


    string.replace(/[\s\.\*<>\!\?,;:\-–\|"'\[\]\(\)]/g, '')


...there's only 1 space there though


those paratextual phenomena probably are important for the model's representations.. not to get rid of and not easily compressable either. have a look at predicitive features for authorship attribution in stylometry for example. whitespace and punctuation are always decisive.


LLMs are also pretty good at programming and I would expect this to nuke programming ability completely, wouldn't it?


Seems you could make a pipeline where a much simpler model adds spaces and punctuation to output from the main model.


I suspect punctuation adds significant meaning to the models, that could be why so much computation is applied to it.

That's not to say a pipeline couldn't be effective.


For the spaces and for some (maybe most?) languages you don't even need a NN to add spaces: as words made of two or more words aren't that common, and when those occur you probably want to use the composite one, it boils down to start from the beginning of the text and look in a dictionary what's the longest string that is a valid word. The only language that I know of that uses a lot of composite words (I mean words made by sticking two or more words togheter) is German, but I think that looking for the longest sequence occurring in a dictionary would be correct most of the times.


I think you're significantly underestimating how many words could be retokenized into multiple words even before considering how concatenation affects things. For example: Concatenate is a word, but so are con, catenate, cat, and enate. Yes, no two of those are likely to be used in sequence, but I don't think that's a very reliable rule overall—"a" and "an" are both common words and negative prefixes.


Maybe you're right. I was biased by my native language, which doesn't have the a/an problem that English has.


Yes I was thinking about that, it should be quite easy afterwards.


Yeah, good to bring it back to the original point. Reading the article felt exciting, but in hindsight I am now missing a key detail.

The equations all seem to be matrix operations with a fixed number of rows / columns (you can take me as a real layman here). Unless you change that, I don't understand _how_ you can reduce memory needs. Granted, I'm probably putting my foot in my mouth not understanding transformers.


More ELI5 than the other comments. Considering the softmax network:

During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20. Remembering that 95% of our values are below 100, we would only have about 5 discrete values for 95% of our values - so we would be losing a lot of "resolution" (entropy/information). For example (assuming rounding is used), an original value of 19 would be quantized to 20 and 30 would be quantized to 40. The original values differ by 11, but the quantized values differ by 20!

This is where exotic encodings come into play. We might try to use a logarithmic scheme, for example. This would result in higher value densities at lower values - but we would probably still waste bits and it would require more APU cycles.

Now switch to the softmax1 network:

The range of values is less important than the distribution - instead of 95% of the values falling in a small range, we would see the values more evenly spread out. Assuming that the range is now 105 (so the 5% outlying neurons from the softmax network are still >100), we would have 243 values to represent everything under 100. The same example with 19 and 30 would result in 19.27 and 30.34 respectively, a difference of 11.07 - which is very close to the unquantized difference of 11. We have retained more information in the quantized version of the network.

Information is lost either way, but what's important is how much information is lost.

The reason that the large values appear is because the heads attempt to "scream really loud" when they are certain that they are right. This is an emergent behavior due to softmax - it ironically sucks at paying attention to a few of the heads: it boosts the volume of the heads that are trying to abstain, and mutes the volume of the heads that are trying to vote.


> During quantization we find that values in the network vary from 0->5000, but 95% of values are <100. Quantizing this to 8bits would mean that our values would be in increments of about 20.

Instead of using an 8bit integer with even step size quantification, wouldn't they still use an 8bit float?


Possibly, it depends on the distribution of the vales. It would also make my examples far less straightforward :)

Either way you would still only have 256 discrete values.


No one quantizes blindly without accounting for data. If 95% of your values are in 0-100 you’ll probably do something like have 20 values for 0-100 and the remaining 12 for 101-5000. You don’t have to apply a uniform distribution and shouldn’t when your data is that concentrated.


Third paragraph.


If I'm following correctly, does this mean that with this change along with a model being quantized, we could see models that are 5% the size (on file system) and memory usage but almost identical in output?


The vales are selected were arbitrary. The size reduction will be 32bits/8bits - so it will be 4 times smaller.


It has to do with the precision of the values stored in those rows and columns. If they could be coerced into a narrower range (without losing information) then we could effectively store them each with 8 bits or something. The +1 prevents blowups when the denominator in its current form approaches 0, and without those blowups, then we can use less bits, in theory.


That is only true if the using the new softmax changes the dynamic range of the values. We are using floating point not fixed point. So if before our values went from 1 to 5000 and now they go from 0.0002 to 1 we still have the same dynamic range and so still need the same resolution.


The quantized versions are not floats but ints.


The activations (outputs) of one layer must be encoded in the same way as the weights of that layer as well as the weights of the next layer or the computation fails (unless you manage to write clever kernels for doing math at different levels of precision simultaneously, but even then you're introducing even more lossiness than just using a binary representation for those values).

Example: multiplying a bunch of float16s together gives you a float16. That is passed on to the next layer of float16s. Why should forcing the output of the first step to be float8 confer any advantage here? The only way I can see this argument working is if you make all the layers float8 too, and the reason you can do that is that the output of the first step can be faithfully represented as float8 because it doesn't ever blow up. If that's what the author is saying, it wasn't very clear.


You can reduce the number of bits per float (scalar).


I actually prefer the conceptual model the author suggests:

> Originally I wanted to call this function ghostmax, as you can think of there being an extra zero-valued entry in x (as exp(0)=1), as well as a zero vector in the V matrix that attenuates the result.

Don't think of this as weighting the options so that some of the time none of them is chosen. ("Weights that add up to less than 1.") Instead, think of this as forcing the consideration of the option "do nothing" whenever any set of options is otherwise considered. It's the difference between "when all you have is a hammer, everything looks like a nail [and gets hammered]" and "when all you have is a hammer, nails get hammered and non-nails get ignored".

I like this framing because, as an example, it bothers me that our speech-to-text systems use this method:

1. A human predetermines what language the input will use.

2. Audio in that language is fed to transcribing software.

3. You get, with modern technology, a pretty decent transcription.

3(a). ...if the audio sample was really in the language chosen in step 1.

If you ignore the choice of language and feed French audio to an English transcriber, you get gibberish. This is wildly at odds with how humans do transcription, where absolutely the first thing that a system that only knows how to transcribe English will do, when given French audio, is object "hey, this is definitely not English".


Most STT systems also tend to still train on normalized text which is free of the punctuation and capitalization complexities and other content you find in text LLMs. I suspect we continue in this way in part due to lack of large scale resources for training, and due to quality issues - Whisper being an outlier here. Anecdotally 8bit quantization of larger pre-normalized STT models seems to not suffer the same degradation you see with LLMs but I can't speak to whether that's due to this issue.


This seems like a good way to look at it. Another way to put it is, there is a certain "origin" or "default" confidence which is pinned to some fixed value pre-softmax, ie, all outputs are necessarily compared to that fixed value (pretending zero is another input to the softmax) rather than merely each other.


I like your description because it's relatively succinct and intuitively suggests why the modified softmax can help the model handle edge cases. It's nice to ask: How could the model realistically learn to correctly handle situation X?


Yea - the way we can "tell if this is good" is by

a) train two identical models on a large dataset, one with the +1 in the denominator for the softmax steps of the attention modules, one without

b) show that they have similar performance (doubt the +1 will make performance better, but we need to show it doesn't make things worse)

c) show that there are less "blowups" in the model with +1, and therefore they are more effectively quantized.


> train two identical models on a large dataset

Yes but how much would this cost?

Would it be possible to build a small dataset that produces known outlier values, and test on that?


It doesn't need to be two huge models. If there is an advantage to doing this, I'd expect that you would see it even in a small test case. I'm sure we'll see something by the end of the week if not earlier if there's something to it.


One of the most significant quantization papers of the last year [1] found precisely that these outliers only start occuring with LLMs at 6.7B parameters and above.

One of the most important keys to the success of deep learning in the last couple years has been the fact that emergent features exist after certain scales, so I wouldn't be too quick to dismiss things that don't help at smaller scales, nor would I be certain that all the tricks that help in small data/parameter regimes will necessarily help in larger models. Unfortunately!

[1] https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...


Looking at that paper, they appear to be saying that 6.7B is where the problem becomes so intense that no single quantization method can keep up. From what I gather, the paper claims that such outliers start occur down to 125M param models, then at around 1.3B they begin to affect the FFN, and at around 6.7B is when the issue really starts to become apparent because "100% of layers use the same dimension for outliers."

So while you obviously wouldn't be able to conclusively prove the idea fixes the issue in larger models, if you know what you are looking for you should be able to validate that the method works in general down to very small models.

That said, consumer grade cards should be able to train an 8B model with quantization, so you might as well train the whole thing.


The reason it might need to be huge is because the long tail of extreme weights might only begin to show up then, but yes best to just start w something you can run on a laptop.


That is a good start. I wonder though if the change affects the ideal hyperparameters. Do you need more or less dropout if you make the change? What about learning rate?

So you might want to re-search the hyper params for a fair shot.


> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.

Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.

Can't speak to how useful it is though.


Surely you mean high-entropy, ie, uniform? We are talking about extremely low-entropy predictions as being the problem here.


yep - always get that the wrong way round haha


This is a technique that's been known for years and is in PyTorch. It's not widely used because people tried it and, in practice, it doesn't work as well.

OP calling it a "bug that's been overlooked for 8+ years" is click bait.


> ...is in PyTorch

Could anyone kindly point me to this as I can't find it.


The add_zero_attn parameter in PyTorch is used for this, but by default their softmax is the regular kind. It has been in flaxformer for a couple years now though, however it claims to be a compatibility variant for older models [2] and I haven't seen any mention of it in their recent papers (though I've not checked exhaustively).

[1]: https://pytorch.org/docs/stable/generated/torch.nn.Multihead... [2]: https://github.com/google/flaxformer/blob/main/flaxformer/co...


> Statistically, we expect that averaged vector to be close to zero.

I'm not sure that's the case, especially in high dimensions.

The expected value of the absolute value n random variables, uniform [-1,1], grows with n. I'm pretty sure it's proportional to the sqrt of n.

Also, random walks in high dimension return to zero with probability zero, so the sum of random variables in high dimensions going close to zero seems unlikely as well.


Both of your points are basically true, but I think a better way to model the problem is as a set of similar-length vectors being linearly combined by a probability vector.

Mathematically, we can write v_out = V * w,

where v_out is the vector of output from the attention unit, w is the probability vector from the softmax, and V is the set of input vectors, where each column is an input vector.

For a moment, pretend that the columns of V are orthonormal to each other. This might not be true, but it's an interesting case.

When the model wants the output to be small, it can set w = 1/n, meaning all coordinates of vector w are 1/n. (n = the number of columns in V)

In that case, the length ||v_out|| will be 1/sqrt(n) exactly, which is small compared to the input lengths of 1 (since we're pretending they were orthonormal).

Now if we stop pretending they are orthonormal, the worst case is that they're all the same vector, in which case the weights w can't change anything. But that's a mighty weird case, and in high dimensions, if you have any randomness at all to a set of vectors, they tend to point in wildly different directions with dot products close to zero, in which case the same intuition for the orthonormal case applies, and we'd expect a uniform distribution coming out of the softmax to give us a vector that's much smaller than any of the input vectors.


One caveat is that the average of many normally distributed vectors in many dimensions is normally distributed with 0 mean but is not typically close to 0. In fact the average norm is quite large. Try it yourself and see!


Don't most softmax implementations include an epsilon in the denominator which likely serves the same purpose? So the suggestion is to set that epsilon to 1?


I agree with your conclusions, but not necessarily with the reasons you present. I don't think it's _that_ easy for a current transformer to pass the information unaltered (i.e. to effectively replace softmax with 0).

In particular, I think the feedforward point you list in your "Second" is actually wrong. Replacing a softmax with 0, as the OP wants to do, is tantamount to passing the information unchanged, because the attention block is within a residual (skip) connection. If it's set to zero, the next output is identical to the previous layer output. There is no way to recover this effect with the feedforward layer.

The part that you can set V to zero is true, but somehow a different idea: the Q and K should be able to set to 0 if no token wants to be "close" to some other token, in some sense. But the V layer shouldn't "know" about this, because it can't look at other tokens. This is of course only how we think of transformers, which might or might not (more likely, the latter) be how it actually works. But nevertheless, having a 0 value coming out of the K.Q^T part only would be very meaningful.

Your "first" point is technically true (albeit logically false): if you have a sequence of length 32k, like GPT4-32k, and your softmax logits all predict the same value, the result will be an average of the V layer, divided by 32k, which is effectively close to zero. However, calibrating "exactly the same value" is extremely hard for a neural network, and there is no "default value" it can predict to make sure that's the case - even if you push all the values to one side, the result doesn't change, because softmax is translation invariant. Plus, if you have a short sentence, that's not true anymore. If you only have two tokens, one of them must be activated, or both with only a 0.5 factor. Surely if you have very few tokens there's much more contamination between Q, K, and V, so in that case V can indeed take a 0 value, but it's non-trivial and requires more layers.

All in all, adding that "+1" isn't quite meaningless, I think. Nevertheless, I believe it won't change much: these very big models have ways to get around any kind of smart small modification you do. If the intuition is very right, it might be that you can squeeze 1% out more accuracy in a handful of tests, after you carefully optimize all other parameters, which would be enough to get you a paper in a top conference. And it might also be implemented as a standard from them on (because, in this case, it basically doesn't cost any more computations, so it's "free"). But I would bet it won't be a major revolution.

That said, as you say, the only way to know would be to train a few models with this option and check the actual quality of them (certainly not GPT-style, nor GPT4-size, models, to begin with, but something quicker to train and easier to test in a fully automated way; old "boring" models like those in the BERT family would be a good point to start testing). But to do that effectively, you'd need somebody skilled in training this kind of models, with the cleaned data ready at hand, etc. (and a small compute budget, of course, but nothing revolutionary, a few thousand $ in GPU credits could be enough)


I am a transformer, it should definitely work




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: