I don’t disagree with you, but always think the “they’re just predicting the next token” argument is kind of missing the magic for the sideshow.
Yes they do, but in order to do that, LLMs soak up the statistical regularities of just about every sentence ever written across a wide swath of languages, and from that infer underlying concepts common to all languages, which in turn, if you subscribe at least partially to the Sapir-Wharf hypothesis, means LLMs do encode concepts of human cognition.
Predicting the next token is simply a task that requires an LLM to find and learn these structural elements of our language and hence thought, and thus serves as a good error function to train the underlying network. But it’s a red herring when discussing what LLMs actually do.
I am disappointed your comment did not have more responses because I'm very interested in deconstructing this argument I've heard over and over again. ("it just predicts the next words in the sentence").
While explanations of how GPT-style LLMs work involve a layering of structures which encode at the first levels some understanding of syntax, grammar etc. and then as the more levels of transformers are added, eventually some contextual and logical meanings are encoded.
I really want to see a developed conversation about this.
What are we humans even doing when zooming out? We're processing the current inputs to determine what best to do in the present, nearest future or even far future. Sometimes, in a more relaxed space (say a "brainstorming" meeting), we relax our prediction capabilities to the point our ideas come from a hallucination realm if no boundaries are imposed.
LLMs mimic these things in the spoken language space quite well.
> ... means LLMs do encode concepts of human cognition
AND
> ... do encode structural elements of our language and hence thought
Quite true. I think the trivial "proof" that what you are saying is correct is that a significantly smaller model can generate sentence after sentence of fully grammatical but nonsense sentences. Therefore the additional information encoded into the network must be knowledge and not syntax (word order).
Similarly, when there is too much quantization applied, the result does start to resemble a grammatical sentence generator and is less mistakable for intelligence.
I make the argument about LLMs being a time series predictor because they happen to be a predictor that does something that is a bit magical from the perspective of humans.
In the same way that pesticides convincingly mimic the chemical signals used by the creatures to make decisions, LLMs convincingly produce output that feels to humans like intelligence and reasoning.
Future LLMs will be able to convincingly create the impression of love, loyalty, and many other emotions.
Humans too know how to feign reasoning and emotion and to detect bad reasoning, false loyalty, etc.
Last night I baked a batch of gingerbread cookies with a recipe suggested by GPT-4. The other day I asked GPT-4 to write a dozen more unit tests for a code library I am working on.
> just about every sentence ever written across a wide swath of languages
I view LLMs as a new way that humans can access/harness the information of or civilization. It is a tremendously exciting time to be alive to witness and interact with human knowledge in this way.
I listened to a radio segment last week where the hosts were lamenting that Europe was able to pass AI regulation but the US Congress was far from doing so. The fear and hype is fueling reaction to a problem that IMO does not exist. There is no AI. What we have is a wonder of what can be achieved through LLMs but it's still a tool rather than a being. Unfortunately there's a lot of money to be made pitching it as such.
Yes they do, but in order to do that, LLMs soak up the statistical regularities of just about every sentence ever written across a wide swath of languages, and from that infer underlying concepts common to all languages, which in turn, if you subscribe at least partially to the Sapir-Wharf hypothesis, means LLMs do encode concepts of human cognition.
Predicting the next token is simply a task that requires an LLM to find and learn these structural elements of our language and hence thought, and thus serves as a good error function to train the underlying network. But it’s a red herring when discussing what LLMs actually do.