At this point I really only take rigorous research papers in to account when considering this stuff. Apple published research just this month that the parent post is referring to. A systematic study is far more compelling than an anecdote.
That study shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks(some even see increases). The one that isn't involves changing more than names.
Changing names does not affect the performance of Sota models.
>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks.
Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".
Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%).
No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.
Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".
Only if there isn’t a systemic fault, eg bad prompting.
Their errors appear to disappear when you correctly set the context from conversational to adversarial testing — and Apple is actually testing the social context and not its ability to reason.
I’m just waiting for Apple to release their GSM-NoOp dataset to validate that; preliminary testing shows it’s the case, but we’d prefer to use the same dataset so it’s an apples-to-apples comparison. (They claim it will be released “soon”.)
https://machinelearning.apple.com/research/gsm-symbolic