They are just predicting next word, but using a pretty deep understanding of pri...

benreesman · on March 22, 2024

This is kind of an epistemological debate at this level, and I make an effort to link to some source code [1] any time it seems contentious.

LLMs (of the decoder-only, generative-pretrained family everyone means) are next token predictors in a literal implementation sense (there are some caveats around batching and what not, but none that really matter to the philosophy of the thing).

But, they have some emergent behaviors that are a trickier beast. Probably the best way to think about a typical Instruct-inspired “chat bot” session is of them sampling from a distribution with a KL-style adjacency to the training corpus (sidebar: this is why shops that do and don’t train/tune on MMLU get ranked so differently than e.g. the arena rankings) at a response granularity, the same way a diffuser/U-net/de-noising model samples at the image batch (NCHW/NHWC) level.

The corpus is stocked with everything from sci-fi novels with computers arguing their own sentience to tutorials on how to do a tricky anti-derivative step-by-step.

This mental model has adequate explanatory power for anything a public LLM has ever been shown to do, but that only heavily implies it’s what they’re doing.

There is active research into whether there is more going on that is thus far not conclusive to the satisfaction of an unbiased consensus. I personally think that research will eventually show it’s just sampling, but that’s a prediction not consensus science.

They might be doing more, there is some research that represents circumstantial evidence they are doing more.

[1] https://github.com/meta-llama/llama/blob/54c22c0d63a3f3c9e77...

jameshart · on March 22, 2024

They are absolutely planning ahead inasmuch as what they are outputting is setting up a continuation. They’re not even word predictors remember - they are token predictors. Are you really saying that when you prompt an LLM with ‘name a large grey land animal’ and it outputs ‘ele’, it isn’t ‘planning’ that the next token will likely be ‘phant’?

The ‘decision’ to output ‘elephant’ is being made further up the neural network than final token selection - after all, it might want to output ‘Ele’ or ‘an’ (with a view to ultimately outputting ‘an elephant’) or ‘a’ (with a view to ultimately outputting ‘a common large grey land animal is an elephant’), or maybe it has been LoRA trained to output all responses as JSON so the first token it needs to output is ‘{‘… but surely the neural activations for that prompt are firing off ‘elephanty’ messages somewhere in the network, right?

So if there’s some sort of symbol activation ahead of token selection, why would it be hard to believe that a large neural network is forming more complex decisions about what it intends to output, in an abstract way, before it selects how to express itself?

And in what way is that distinct from ‘planning ahead’?

HarHarVeryFunny · on March 22, 2024

> Are you really saying that when you prompt an LLM with ‘name a large grey land animal’ and it outputs ‘ele’, it isn’t ‘planning’ that the next token will likely be ‘phant’?

The model outputs words, not tokens, so that is not a great example.

Any prompt will have multiple possible (predict next word) continuations, which you can think of as branching futures. Many possible next words, each of which have many possible following words, etc, etc.

The model is essentially predicting over all these possible futures. You can call it planning if you like, but remember that the model has no idea of which of these branching futures it is going to follow - it literally doesn't even know which word it is going to output next - it is just providing a bunch of probabilities (predictions) of next word, and the sampling process is then picking one - not necessarily the most confident next word prediction.

The model really is winging it word by word, even if those (multiple alternative) next words are only probable because they are part of coherent following sentences in the training data.

jameshart · on March 22, 2024

Why so adamant that models work on ‘words’?

ChatGPT3.4/4 tokens:

   “Elephant”: 46439, 28022 - “Ele” “phant”
   “elephant”: 10274, 28022 - “ele” “phant”
   “ Elephant”: 79189
   “ elephant”: 46840
   “ elephantine”: 46840, 483 - “ elephant” “ine”

Tokens are tokens. If it was limited to words it wouldn’t be able to produce non-words, but GPT and other LLMs are quite capable of inventing words, outputting nonsense words, and modifying words.

Regarding the ‘no idea which future it is going to follow’ - sure, it doesn’t know which future; indeed the sampler phase is going to pick an output merely based on the probabilities it’s outputting. But it’s outputting higher probabilities for some tokens because they are good tokens to use to lead to probable futures. It’s suggesting taking steps down certain paths because those paths are likely to lead to useful places.

HarHarVeryFunny · on March 22, 2024

I didn't say WORK on words, I said OUTPUT words.

But, it doesn't make any difference whether you are considering tokens or words. There are multiple possible continuations of the prompt, and the next word (or token) output does not - in general - force the word (or token) after that ...

Your "large grey mammal" could be an "elected official in a grey suit".

jameshart · on March 22, 2024

Right, it’s possible, but when the LLM places a high probability on the “ele” token it’s not because it predicts “elected official” is a likely continuation. It’s because it’s thinking about elephants.

Likewise when a coding LLM starts outputting a for each loop, it’s doing so because it expects to want to write some code that operates on each item in a list. I don’t see how you can explain that behavior without thinking that it must be generating some sort of high level algorithmic plan that causes it to feel like the next thing it should output is some sort of ‘foreach’ token.

HarHarVeryFunny · on March 22, 2024

I'm not disagreeing with what is presumably happening, but rather on how to characterize that.

Of course next word predictions are not based directly on surface level word sequence patterns - they are based on internal representations of what these word sequences mean, and predicted continuations are presumably going to be at a similar level of abstraction/representation (what you are calling a plan). This continuation "plan" then drives actual word selection/prediction.

Where we seem to differ is whether this high level continuation representation can really be considered as a "plan". To me the continuation is just a prediction, as are the words that might be used to start expressing that continuation, and presumably it's not even a single continuation with multiple ways of expressing it (turning it into a word sequence), but rather some superposition of multiple alternate continuations.

When we get to the level of words output it becomes even less plan-like since the actual word output is randomly sampled, and when fed back in as part of the "sentence so far" may cause the model to predict a different continuation (or set of continuations) than it had at the prior step. So, any "plan" (aka predicted continuation) is potentially changing continuously from word to word, rather than being decided ahead of time and then executed. As I noted elsewhere in this thread, the inability to plan multiple words ahead is behind these model's generally poor performance on the "give me a sentence ending in <word>" task, as opposed to perfect performance on the "give me a sentence starting with <word>" one.

If we contrast this behavior of a basic LLM to the "tree of thoughts" mechanism that has been proposed, it again highlights how unplan-like the basic behavior is. In the tree of thoughts mechanism the model is sampled from multiple times generating multiple alternate (multi-word) continuations, which are then evaluated with the best being chosen. If the model were really planning ahead of time it seems this should not be necessary - planning would consist of considering the alternatives BEFORE deciding what to generate.

danieldk · on March 22, 2024

The model outputs words, not tokens, so that is not a great example.

Virtually all modern transformer models use pieces, which may be words, but also subwords. Theoretically, they could be longer units, but in most cases some characters (like whitespace) are used as piece boundaries when training the piece vocabulary. If they didn’t use pieces, they’d work terribly on languages where e.g. compounds are a single word.

In most realistic piece vocabs, ‘elephant’ will be a single piece, since it’s a fairly frequent word. But it’s totally possible in a small vocab that it would be split like the parent said and conversely, it would generate elephant by first predicting one piece.

Some piecing methods, like BBPE have bytes as the smallest unit, so theoretically an unknown token could be split up (and generated) as pieces consisting of bytes.

sasja · on March 22, 2024

If you work out the loss function next token prediction, next 2 token prediction or next n token prediction, you will find they are identical. So it's equally correct to say the model is trained to find the most probable unlimited continuation. Saying "it only predicts the next token" is not untrue but easily leads to wrong conclusions.

naasking · on March 22, 2024

> Saying "it only predicts the next token" is not untrue but easily leads to wrong conclusions.

Indeed, it's akin to saying that "only quantum fields exist" and then concluding that therefore people do not exist.

gbasin · on March 22, 2024

what would it mean to plan ahead? decoding strategies like beam search are popular and effectively predict many words ahead

HarHarVeryFunny · on March 22, 2024

Think before generating output - plan the entire sentence before you generate the first word(s) and maybe talk yourself into a corner. Tree-of-Thoughts (not Chain) is one way to provide something a bit similar - kind of like DeepBlue or AlphaGo generating possible branching future lines of play and picking the one with best outcomes.

To be more brain-like you'd really want the system to generally be "looping" internally - a bit like our thalamo-cortical loop - and only start outputting when the thought had gelled.

HarHarVeryFunny · on March 22, 2024

It's a shame HC doesn't use an LLM to upvote/downvote rather than people. Take the emotion out of technical discussions and rate based on factuality instead.

I suppose whoever downvoted this either hasn't heard of tree-of-thoughts, or doesn't understand what it is and what problem it is addressing. Or, maybe they just didn't like that their "gotcha" question had a simple answer.

astrange · on March 23, 2024

LLMs have emotions ;)

https://vgel.me/posts/representation-engineering/

jumpCastle · on March 22, 2024

Also the parameters are optimized also with loss of future tokens in the sequence.

nyrulez · on March 22, 2024

I mean are we as humans planning ahead of the new few words? I certainly am not. But what matters is a deeper understanding of the context and the language model itself, which can then produce sensible spontaneous output. We as humans have the advantage of having a non language world model as well as abstract concepts but all of human language is a pretty strong proxy for it.

The spontaneity of it isn't the issue, it's what's driving the spontaneity that matters. For e.g. 1M context window is going to have a wildly more relevant output than a 1K context window.

ben_w · on March 22, 2024

> I mean are we as humans planning ahead of the new few words? I certainly am not.

For me, sometimes either way. At least, that's my subjective self-perception, which is demonstrably not always a correct model for how human brains actually work.

We also sometimes appear to start with a conclusion and then work backwards to try to justify it; we can also repeatedly loop over our solutions in the style of waterfall project management, or do partial solutions and then seek out the next critical thing to do in the style of agile project management.

Many of us also have a private inner voice, which I think LLMs currently lack by default, though they can at least simulate it regardless of what's really going on inside them and us (presumably thanks to training sets that include stories where a character has an inner monologue).

HarHarVeryFunny · on March 22, 2024

> I mean are we as humans planning ahead of the new few words? I certainly am not.

Sometimes we do, sometimes not.

Sometimes we just say stock phrases such as "have a nice day", or "you too" that are essentially "predict next word", but if I asked you something you'd never done before such as "how can we cross this river, using this pile of materials" you'd have to think it though.

Some people may use their inner monologue (or visualization) to think before speaking, and others may essentially use "chain of thought" by just talking it though and piecing together their own realizations "well, we could take that rope and tie it to the tree ...".