Many of my PhD and post doc colleagues who emigrated from Korea, China and India who didn’t have English as the medium of instruction would struggle with this question. They only recover when you give them a hint. They’re some of the smartest people in general. If you try to stop stumping these models with trick questions and ask it straightforward reasoning systems it is extremely performant (O1 is definitely a step up though not revolutionary in my testing).
I live in one of the countries you mentioned and just showed it to one of my friends who's a local who struggles with English. They had no problem concluding that the doctor was the child's dad. Full disclosure, they assumed the doctor was pretending to be the child's dad, which is also a perfectly sound answer.
The claim was that "it knows english at or above a level equal to most fluent speakers". If the claim is that it's very good at producing reasonable responses to English text, posing "trick questions" like this would seem to be a fair test.
Does fluency in English make someone good at solving trick questions? I usually don’t even bother trying but mostly because trick questions don’t fit my definition of entertaining.
No, it's necessary to either know that it's a trick question or to have a feeling that it is based on context. The entire point of a question like that is to trick your understanding.
You're tricking the model because it has seen this specific trick question a million times and shortcuts to its memorized solution. Ask it literally any other question, it can be as subtle as you want it to be, and the model will pick up on the intent. As long as you don't try to mislead it.
I mean, I don't even get how anyone thinks this means literally anything. I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand, they simply did what literally any human does, make predictions based on similar questions.
> I can trick people who have never heard of the trick with the 7 wives and 7 bags and so on. That doesn't mean they didn't understand
They could fail because they didn’t understand the language. Didn’t have a good memory to memorize all the steps, or couldn’t reason through it. We could pose more questions to probe which reason is more plausible.
The trick with the 7 wives and 7 bags and so on is that no long reasoning is required. You just have to notice one part of the question that invalidates the rest and not shortcut to doing arithmetic because it looks like an arithmetic problem. There are dozens of trick questions like this and they don't test understanding, they exploit your tendency to predict intent.
But sure, we could ask more questions and that's what we should do. And if we do that with LLMs we can quickly see that when we leave the basin of the memorized answer by rephrasing the problem, the model solves it. And we would also see that we can ask billions of questions to the model, and the model understands us just fine.
Some people solve trick questions easily simply because they are slow thinkers who pay attention to every question, even non-trick questions, and don't fast-path the answer based on its similarity to a past question.
Interestingly, people who make bad fast-path answers often call these people stupid.
It does mean something. It means that the model is still more on the memorization side than being able to independently evaluate a question separate from the body of knowledge it has amassed.
No, that's not a conclusion we can draw, because there is nothing much more to do than memorize the answer to this specific trick question. That's why it's a trick question, it goes against expectations and therefore the generalized intuitions you have about the domain.
We can see that it doesn't memorize much at all by simply asking other questions that do require subtle understanding and generalization.
You could ask the model to walk you through an imaginary environment, describing your actions. Or you could simply talk to it, quickly noticing that for any longer conversation it becomes impossibly unlikely to be found in the training data.
It's knowledge is broad and general, it does not have insight into the specifics of a person's discussion style, there are many humans that struggle with distinguishing sarcasm for instance. Hard to fault it for not being in alignment with the speaker and their strangely phrased riddle.
It answers better when told "solve the below riddle".
I think you have particularly dumb colleagues then. If you post this question to an average STEM PhD in China (not even from China. In China) they'll get it right.
This question is the "unmisleading" version of a very common misleading question about sexism. ChatGPT learned the original, misleading version too well that it can't answer the unmisleading version.
Humans who don't have the original version ingrained in their brains will answer it with ease. It's not even a tricky question to humans.
In general LLMs seem to function more reliably when you use pleasant language and good manners with them. I assume this is because because the same bias also shows up in the training data.