Maybe we should teach Prolog/CLP and PDDL to LLMs. Unfortunately the training se...

andoando · on March 22, 2024

Considering GPT can do programming and logic to some level, I assume it has has training of that sort? It can seem to do logic even on some completely made up abstract notions. For example "Consider a jumajambi has 2 jimimis. Each jimijimi is a jomololo or a joobajooba. How many possible variations of jumajambi are there if there are 4 jumajambi?".

People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.

og_kalu · on March 22, 2024

>People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.

Next token prediction is the objective function. The model is asked to predict the next word yes but it's also allowed to compute the answer and more importantly, the entire training process is supposed to be the model learning and figuring out what sort of computations aid the prediction of the corpus it's trained on.

If your corpus is language A followed by the translation in Language B then there's little choice but for the model to learn computations that translate as loss goes down.

Is your corpus is chess moves then again, it's going to have to learn how to compute chess games to reduce loss.

You can see this with toy models trained on toy problems. Example - a tiny transformer trained on addition examples - x + y = z learning an algorithm for addition.

https://cprimozic.net/blog/reverse-engineering-a-small-neura...

"Pick the right word" is not a trivial exercise for the vast majority of text data.

And again because people often make this mistake but a LLMs ultimate objective is NOT to produce "text that looks right" but "text that is right". Of course "right" as determined by the training corpus but basically anytime it picks a wrong word is opportunity for the model to learn and learn it does.

drdeca · on March 22, 2024

> People keep calling it "next next token predictors", but clearly there is something more going on

I think this depends what you mean by "something more going on".

Now, if someone says that it is "just" "next token prediction", in a dismissive way, I think that's an error.

But, while they RLHF ones aren't exactly trained just to match the observed distribution, but rather are trained with the RLHF objective, it is nonetheless true that the model produces a probability distribution over possible next tokens, conditioned on the previous tokens, and samples from that. (I suppose there's also like, things done as part of the sampling on top of these conditional probabilities, rather than just sampling according to the probabilities given the temperature. (I don't know how this part works really.) But I think this is mostly just a trick to get a little more quality, and not a major part of how it behaves? Not part of the NN itself in any case.)

HarHarVeryFunny · on March 22, 2024

> People keep calling it "next next token predictors", but clearly there is something more going on and I would love for someone to give a simple explanation.

Starting from a point of outputting random gibberish, the only feedback these models are given during training is whether their next word prediction was right or wrong (i.e. same as next word in the training sample they are being fed). So, calling these models "next word predictors" is technically correct from that point of view - this is their only "goal" and only feedback they are given.

Of course, what these models can accomplish, reflecting what they have learnt, is way more impressive than what one might naively expect from such a modest goal.

The simple, usual, and rather inadequate, explanation for this mismatch between training goal and capability is that in order to get really, REALLY, good at "predict next word", you need to learn to understand the input, extremely well. If the input is "1+2=" then the model needs to have learnt math to predict next word and get it right. If the input is a fairy tale, then it needs to learn to recognize that, and learn how to write fairy tales.

This is how these LLM's "predict next word" goal turns into a need for them to learn "everything about everything" in order to minimize their training error.

The question of course then becomes how do they do it? We are training them on pretty much everything on the internet, so plenty to learn from, but only giving them some extremely limited feedback ("no, that's not the correct next word"), so what magic is inside them that let's them learn so well?!

Well, the magic is a "transformer", a specific (and surprisingly simple) neural network architecture, but this is pretty much where the explanation ends. It's relatively easy to describe what a transformer does - e.g. learning which parts of it's input to pay attention to when predicting next word, and doing this in a very flexible way using "keys" that it learns and can search for in the input, but it is extremely hard to explain how this mechanism let's it learn what it does. Interpreting what is really going on inside a transformer is an ongoing research area.

I think that maybe the best that can be said is that the transformer designers stumbled upon (I'm not sure they were predicting ahead of time how powerful it would be) an extremely powerful and general type of sequence processor, and one that appears to be very well matched to how we ourselves generate and recognize language. Maybe there is some insight to be learnt there in terms of how our own brains work.

astrange · on March 22, 2024

> Starting from a point of outputting random gibberish, the only feedback these models are given during training is whether their next word prediction was right or wrong (i.e. same as next word in the training sample they are being fed). So, calling these models "next word predictors" is technically correct from that point of view - this is their only "goal" and only feedback they are given.

This is true for pretraining - creating a "base model" - but it's not true for instruction tuning. There's a second stage (RLHF, DPO, whatever) where it's trained again with the objective being "take questions and generate answers" and from there "generate correct answers".

I would expect there could be further advancements where we actually program algorithms into transformers (which can be done) and then merge models with proven capabilities together rather than trying to train everything by example. Or emit tool-running tokens which can do unbounded computation.

> so what magic is inside them that let's them learn so well?!

Funny thing is there _are_ known limits to what it can do. In particular, it can't do reverse association from anything it learned going forwards. This is called the "reversal curse".

ie, if you give GPT4 a line from a song it can tell you what the line after it is, but it's a lot worse at the line before it!

HarHarVeryFunny · on March 23, 2024

> This is true for pretraining - creating a "base model" - but it's not true for instruction tuning. There's a second stage (RLHF, DPO, whatever) where it's trained again with the objective being "take questions and generate answers" and from there "generate correct answers".

Yes, but those are essentially filters, applied after the base model has already learnt it's world model. I think these are more controlling what the model generates that what it learns, since you don't need much data for this.

> merge models with proven capabilities together rather than trying to train everything by example

Merging specialist LLMs is already a recent thing. I'm not sure how it works exactly but basically merging weights post-training. Yannic Kilcher mentioned this on one of his recent YouTube videos.

> if you give GPT4 a line from a song it can tell you what the line after it is, but it's a lot worse at the line before it!

I suppose a bidirectional transformer like BERT would handle this better, but generative language models are deliberately only using the past to predict the future, so this might be expected. Some short term memory (an additional "context" persisting across tokens) would presumably help.

mycall · on March 23, 2024

Does Quiet STaR [0] address the association issue, forward reasoning using past learning?

[0] https://arxiv.org/abs/2403.09629

astrange · on March 23, 2024

No; it can reason backwards from things it found in context, just not things trained into the model. If you have lines A, B, C there's no association in the model back from C to B. I don't think this can be solved by better reasoning.

A proposed solution I saw recently was to feed every training document in backwards as well as forwards.

andoando · on March 23, 2024

So I understand how the classical ML classification/regression works. I can see how if you applied the same method to each word you can produce sentences.

"Dogs are" -> "animals that..."

Where Im confused is how this method would learn logic. I can imagine after seeing a certain amount of patterns "dogs are animals, "birds are animals", "cats are animals", it encodes a concept like "x are animals", which is connected to other concepts like "animals are y"

How is this encoded in the model? Can we see the abstraction it has formed?

"for every word I give you, reverse every alternating word".

How does this fit into next word prediction? Its not just a matter of seeing what a reversal or alternating is, it has to actually compute these things. That cant be just predict the next word iteratively.

I imagine its a system of different models, with a "master" model trained on directing prompts -> type of model to use.

ijk · on March 23, 2024

I've seen succesful projects that add a constraint solver as a layer in a neural network, so it's potentially something that could be integrated at an even deeper level than our current finetuning for tool use.

It's not a priority for the current big model architecture, but there's a bunch of stuff we could be doing with network architecture.