they’re simply statistical systems predicting the likeliest next words in a sentence
They are far from "simply", as for that "miracle" to happen (we still don't understand why this approach works so well I think as we don't really understand the model data) they have a HUGE amount relationships processed in their data, and AFAIK for each token ALL the available relationships need to be processed, so the importance of a huge memory speed and bandwidth.
And I fail to see why our human brains couldn't be doing something very, very similar with our language capability.
So beware of what we are calling a "simple" phenomenon...
Onus of proof fallacy (basically "find the idea I'm referring to yourself"). You might want to clarify or distill your point from that publication without requiring someone to read through it.
A simple statistical system based on a lot of data can arguably still be called a simple statistical system (because the system as such is not complex).
Last time I checked a GPT is not something simple at all... I'm not the weakest person understanding maths (coded a kinda advanced 3D engine from scratch myself a long time ago) and still it looks to me something really complex. And we keep adding features on top of that I'm hardly able to follow...
It's not even true in a facile way for non-base-models, since the systems are further trained with RLHF -- i.e., the models are trained not just to produce the most likely token, but also to produce "good" responses, as determined by the RLHF model, which was itself trained on human data.
Of course, even just within the regime of "next token prediction", the choice of which training data you use will influence what is learned, and to do a good job of predicting the next token, a rich internal understanding of the world (described by the training set) will necessarily be created in the model.
See e.g. the fascinating report on golden gate claude (1).
Another way to think about this is let's say your a human that doesn't speak any french, and you are kidnapped and held in a cell and subjected to repeated "predict the next word" tests in french. You would not be able to get good at these tests, I submit, without also learning french.
They are far from "simply", as for that "miracle" to happen (we still don't understand why this approach works so well I think as we don't really understand the model data) they have a HUGE amount relationships processed in their data, and AFAIK for each token ALL the available relationships need to be processed, so the importance of a huge memory speed and bandwidth.
And I fail to see why our human brains couldn't be doing something very, very similar with our language capability.
So beware of what we are calling a "simple" phenomenon...