> Mercury gets this right - while as of right now ChatGPT 4o get it wrong.
This is so common a puzzle it's discussed all over the internet.
It's in the data used to build the models.
What's so impressive about a machine that can spit out something easily found with a quick web search?
There is sufficient stochasticity in LLMs to invalidate most comparisons at this level. Minor changes in the prompt text, even from run to run in the same model, will produce different results (depending on temperature and other paramters), much less different models.
Try re-running your test on the same model multiple times with the identical prompt,
or varying the prompt.
Depending on how much context the service you choose is keeping for you across a conversation,
the behavior can change.
Something as simple as prompting an incorrect response with a request to try again because the result was wrong can give different results.
Statistically,
the model will eventually hit on the right combination of vectors and generate the right words from the training set,
and as I noted before,
this problem has a very high probability of being in the training data used to build all the models easily available.
> Mercury gets this right - while as of right now ChatGPT 4o get it wrong.
This is so common a puzzle it's discussed all over the internet. It's in the data used to build the models. What's so impressive about a machine that can spit out something easily found with a quick web search?