It's not even true in a facile way for non-base-models, since the systems are fu...

It's not even true in a facile way for non-base-models, since the systems are further trained with RLHF -- i.e., the models are trained not just to produce the most likely token, but also to produce "good" responses, as determined by the RLHF model, which was itself trained on human data.

Of course, even just within the regime of "next token prediction", the choice of which training data you use will influence what is learned, and to do a good job of predicting the next token, a rich internal understanding of the world (described by the training set) will necessarily be created in the model.

See e.g. the fascinating report on golden gate claude (1).

Another way to think about this is let's say your a human that doesn't speak any french, and you are kidnapped and held in a cell and subjected to repeated "predict the next word" tests in french. You would not be able to get good at these tests, I submit, without also learning french.

(1) https://www.anthropic.com/news/golden-gate-claude