As far as I understand, in this article he argues that in the, say, ChatGPT output, compression happens.
But does it really?
To make a similarly low resolution metaphor, a “bayesian kaleidoscope” of a language model doesn’t necessarily mean it blurs the “word pixels” it is moving around. Because moving them around, rearranging them is what it essentially does, even if in opaque ways; but not degrading them, not changing letters in words or deliberately algorithmically messing up the word order in a sentence.
To make sense of the “image” an LLM produces is left up to us, and therefore, it is also up to us to decide whether any compression of anything has happened. And then, how do you measure it?
If you cut a painting into pieces, then glue them back together at random, thus making a new painting, would that constitute a “compression” or just a new painting, which could be worse or better than the original?
I quite like Chiang’s writing, but not this time. If anything, his take on this undermines what he previously wrote a little bit, painting him more of an LLM that he probably would like to admit :)
A better metaphor would be to say it compresses the internet, creates a Markov chain based on that compression. Then to make it work it compresses your prompt so that it can find it in the markov chain, move to the next step, and make a lossy decompression into a text token and adds it. The lossy decompression here is the temperature, higher temperature more lossy and more random words, but since it is lossy in the "meaning" space the random words would still have very similar meaning to before.
That isn't a perfect metaphor, but it explains very well how it can do most of the things it can do. The lossy compression means that it can work with large prompts and just capture their essence instead of trying to look them up literally, and the lossy decompression lets it vary its output and the text will move in slightly different directions instead of just repeating text it has seen. The magical bit is that this compression and decompression is much smarter than before, it parses text to a format much closer to its meaning than before, and that lets us do the above much more intelligently.
Edit: Thinking a bit, maybe you could make these model way cheaper to run if we would make them work as a compression to meaning rather than the huge models they are now? They do have internal understanding/meaning of the tokens it gets, so it should be possible to create a compression/decompression function based on these models that transforms text into its world model state, and then once we start working with world model states things should be super cheap relative to what we have now.
Also maybe it doesn't have lossy decompression and get words with similar meaning, but that is another way I see the models could be smaller and cheaper while keeping their essence. The Markov chain step could be all it uses currently. But it definitely creates that space and Markov chain, because it parses the previous thousand or so tokens and uses those to guess the next token, that is a Markov chain. It just has a very sophisticated way of parsing those thousand tokens into a logical format.
> creates a Markov chain based on that compression
I dislike that interpretation. It suggests it builds a very basic statistical model, but a very basic statistical model simply wouldn't be able to do what these models can do.
Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.
> but a very basic statistical model simply wouldn't be able to do what these models can do.
Why do you think that? Why do you think a basic statistical continuation of the logic of a text wouldn't do what the current model does? There are trillions of conversations out there it can rely on to continue the text, people playing theatre, people roleplaying, tutorials, people playing opposite games, people brainstorming etc. Create a parser that can parse those down to logic, then make a markov chain based on that, and I have no problem seeing the current ChatGPT skills manifesting from that.
> Or alternatively, if you want to consider the model as a markov chain mapping the probability from the previous four thousand tokens to the next token then the space is astronomically large. Beyond astronomically and even economically large, there are ~50,000^4096 possible input states.
Yes, that is the novel thing, it compresses the states down to something manageable without losing the essence of the text, and then builds a model there of likely next token.
As far as I understand, in this article he argues that in the, say, ChatGPT output, compression happens.
But does it really?
To make a similarly low resolution metaphor, a “bayesian kaleidoscope” of a language model doesn’t necessarily mean it blurs the “word pixels” it is moving around. Because moving them around, rearranging them is what it essentially does, even if in opaque ways; but not degrading them, not changing letters in words or deliberately algorithmically messing up the word order in a sentence.
To make sense of the “image” an LLM produces is left up to us, and therefore, it is also up to us to decide whether any compression of anything has happened. And then, how do you measure it?
If you cut a painting into pieces, then glue them back together at random, thus making a new painting, would that constitute a “compression” or just a new painting, which could be worse or better than the original?
I quite like Chiang’s writing, but not this time. If anything, his take on this undermines what he previously wrote a little bit, painting him more of an LLM that he probably would like to admit :)