> My go to puzzle is this: > Mercury gets this right - while as of right now Cha...

jonplackett · 2025-05-01T07:01:46 1746082906

Just that what I thought would be better models don’t do it right.

I was expecting this model to be no-where near chatGPT

Although someone above is saying 4o-mini got it right so maybe it’s meaningless. Or maybe thinking less helps…

cratermoon · 2025-05-01T16:58:25 1746118705

There is sufficient stochasticity in LLMs to invalidate most comparisons at this level. Minor changes in the prompt text, even from run to run in the same model, will produce different results (depending on temperature and other paramters), much less different models.

Try re-running your test on the same model multiple times with the identical prompt, or varying the prompt. Depending on how much context the service you choose is keeping for you across a conversation, the behavior can change. Something as simple as prompting an incorrect response with a request to try again because the result was wrong can give different results.

Statistically, the model will eventually hit on the right combination of vectors and generate the right words from the training set, and as I noted before, this problem has a very high probability of being in the training data used to build all the models easily available.