Have you read the "sparks of AGI" paper about GPT4? It suggested that even just text can give an LLM a rich world model, based on the tikz drawings of a unicorn that got progressively better as GPT4 precursors were trained on increasingly more data (and, interestingly, the drawings got worse when it was RLHF'd for safety).
Yes of course, as always, it's very possible that just scaling solved the problem, but the fact that the model is so good makes me wonder if they actually did something different and pre-trained the model on image tokens as well.