The conclusion may be obvious to you and me (although it's hard to know for certain, since these available LLMs are black boxes). But it's definitely not obvious to everyone. There are plenty of people saying this is the dawn of AGI, or that we're a few short steps from AGI. Whereas people like Gary Marcus (who knows tons more than I do) says LLMs are going off in the wrong direction.
Yes, LLMs can't reason 100% correctly, but neither do humans. We can often reason correctly, but it's not always the case.
Even reasoning, fundamental as it is, comes from feedback. Feedback from our actions teaches us how to reason. Learning from feedback is more general than reasoning - AI agents can definitely learn this way too, if they have enough freedom to explore. But you can't do it with supervised training sets alone.
You need to put language models into agents in environments, give them goals and rewards. Then they can make their own training data and mistakes, build up their own experience. Can't teach an AI based on how people make mistakes, it needs to fix its own mistakes, but that means to deploy it in the wild, where there are consequences for errors.
If you remember, DeepMind first tried to train a Go model on human game play, but it was limited. Then they started from scratch, and learning from feedback alone they surpassed human level, even though the feedback was one single bit of information at the end of a whole self-play game. And it had no pre-training prior to learning Go, unlike human players.
That presupposes that language encodes the world we live in completely, whereas in fact language is meaningless without the shared context of reality. 'up' and 'down' are only meaningful to intelligence that can experience space.
Essentially LLMs just are oracles for the shadows on Plato's cave.
The LLMs do indeed deal with Plato's shadows, but so do we - what we "see", after all, is not the actual up or down, but a series of neural activations from our retinas (which aren't even 3D, so concepts like "behind" are only captured by proxy). Such activations can all be readily encoded into tokens, which is exactly what models specifically trained to describe images do.
Do a reverse Chinese room experiment - remove from a human all the abilities multi-modal LLMs gain after training on human media. What's left? Homo ferus.
Most of our intelligence is encoded in the environment and language, it's a collective process, not an individual one. We're collectively, not individually, very smart.
Of course you're saying that LLMs can only train on textual data, whereas we are developing multimodal AI at this time that takes things like visual, audible, and whatever other kind of sensor data and turn it into actionable information.
TLDR: Internal LLM representations correspond to an understanding of the visual world. We've all seen the Othello example, which is too constrained a world to mean much, but even more interesting is that LLMs can caption tokenized images with no pretraining on visual tasks whatsoever. Specifically, pass an image to an encoder-decoder visual model trained in a completely unsupervised manner on images -> take the encoded representation -> pass the encoded representation to an LLM as tokens -> get accurate captions. The tests were done on gpt-j, which is not multimodal and only has about 7bn params. The only caveat is that a linear mapping model needs to be trained to map the vector space from the encoder-decoder model to the embedding space of the language model, but this isn't doing any conceptual labour, it's only needed to align the completely arbitrary coordinate axes of the vision and language models, which were trained separately (akin to an American and a European to agreeing to use metric or imperial — neither’s conception of the world changes).
It's not intuitive, but it's hard to argue with these results. Even small LLMs can caption images. Sure, they don't get the low-level details like the texture of grass, but they get the gist.
I keep reading your sort of analysis, but honestly, those priors need updating. I had to update when learning this. If 7bn params can do it, 175bn params with multimodality can certainly do it.
It's true that humans need symbol grounding, but we don't see hundreds of billions of sequences. There are theoretical reasons (cf category theory) why this could work, albeit probably limited to gist rather than detail.
The real question isn't whether the LLM can reason.
The question is whether an assembly of components, one of which is an LLM (others would include memory and whatever else is needed to make it a self-contained loop with a notion self-identity) can reason.