Are they displaying reasoning, or the outcome of reasoning, leading you to a false conclusion?
Personally, I see ChatGPT say "water doesn't freeze at 27 degrees F" and think "how can it possibly do advanced reasoning when it can't do basic reasoning?"
I'm not saying it reasons reliably, at all (nor has much success with anything particularly deep: I think in a lot of cases it's dumber than a lot of animals in this respect). But it does a form of general reasoning which other more focused AI efforts have generally struggled with, and it's a lot more successful than random chance. For example, see how ChatGPT can be persuaded to play chess. It still will try to make illegal moves sometimes, hallucinating pieces in the board state or otherwise losing the plot. But if you constrain it and only consider the legal moves, it'll usually beat the average person (i.e. someone who understands the rules but has very little experience), even if it'll be trounced by an experienced player. You can't do this just by memorisation or random guessing: chess goes off-book (i.e. into a game state that has never existed before) very quickly, so it must have some understanding of chess and how to reason about the moves to make, even if it doesn't color within the lines as well as a comparatively basic chess engine.
(Basically, I don't think there's a bright line here: saying "they can't reason" isn't very useful, instead it's more useful to talk about what kinds of things they can reason about, and how reliably. Because it's kind of amazing that this is an emergent behaviour of training on text prediction, but on the other hand because prediction is the objective function of the training, it's a very fuzzy kind of reasoning and it's not obvious how to make it more rigourous or deeper in practice)
This is the most pervasive bait-and-switch when discussing AI: "it's general reasoning."
When you ask an LLM "what is 2 + 2?" and it says "2 + 2 = 4", it looks like it's recognizing two numbers and the addition operation, and performing a calculation. It's not. It's finding a common response in its training data and returning that. That's why you get hallucinations on any uncommon math question, like multiplying two random 5 digit numbers. It's not carrying out the logical operations, it's trying to extract the an answer by next token prediction. That's not reasoning.
When you ask "will water freeze at 27F?" and it replies "No, the freezing point of water is 32F", what's happening is that it's not recognizing the 27 and 32 are numbers, that a freezing point is an upper threshold, and that any temperature lower than that threshold will therefore also be freezing. It's looking up the next token and finding nothing about how 27F is below freezing.
Again, it's not reasoning. It's not exercising any logic. Its huge training data set and tuned proximity matching helps it find likely responses, and when it seems right, that's about the token relationship pre-existing in the training data set.
That it occasionally breaks the rules of chess just shows it has no concept of those rules, only that the next token for a chess move is most likely legal because most of its chess training data is of legal games, not illegal moves. I'm unsurprised to find that it can beat an average player if it doesn't break the rules: most chess information in the world is about better than average play.
If an LLM came up with a proof no one had seen, but it checks out, that doesn't prove it's reasoning either, just because it's next token prediction that came up with it. It found token relationships no one had noticed before, but that's inherent in the training data, and not a reflective intelligence doing logic.
When we discuss things like reinforcement learning and chain of reasoning, what we're really talking about are ways of restricting/strengthening those token relationships. It's back-tuning of the training data. Still not doing logic.
Put more succinctly: if it came up with a new proof in math that was then verified, and you went back and said "no, that's wrong" it would immediately present a different proof, denying the validity of its first proof, because it didn't construct anything logical that it can stand on and say "no, I'm right".
These are all examples of how they're not very good at reasoning, not that they don't reason at all. Being a perfectly consistent logical process is not a requirement for reasoning.
Personally, I see ChatGPT say "water doesn't freeze at 27 degrees F" and think "how can it possibly do advanced reasoning when it can't do basic reasoning?"