OpenAI was never caught cheating on it, because we didn’t cheat on it.
As with any eval, you have to take our word for it, but I’m not sure what more we can do. Personally, if I learned that a OpenAI researcher purposely or accidentally trained on it, and we didn’t quickly disclose this, I’d quit on the spot and disclose it myself.
(I work at OpenAI.)
Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals. The gray zone where things could go wrong is differing levels of care used in scrubbing training data of equivalent or similar problems. At some point the line between learning and memorizing becomes blurry. If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?
The false statement is “OpenAI got caught [gaming FrontierMath] a while back.”
From a primary source:
"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."
Why are you being disingenuous? Simply having access to the eval in question is already enough for your synthetics guys to match the distribution, and of course you don't contaminate directly on train, that would be stupid, and you would get caught, but if it does inform the reward, the result is the same. You _should_ quit, but you wouldn't because you'd already convinced yourself you're doing RL God's work, not sleight of hand.
> If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?
This kind of disingenuous bullshit is exactly why people call you cheaters.
> Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals.
You guys should follow Apple's cult guidelines: never stand out. Think different
"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."
The guidelines apply to everyone. I didn't like that part of their comment either, and the comment would have been better without it, but they were put on the defensive by what had come before and they went on to explain their position. Your phrase "got caught with your hand in the cookie jar" is a swipe that the thread could also have done without. It's no big deal, and it applies to everyone on the subthread; we want everyone to avoid barbs like that on HN.
The incorrect part is “OpenAI got caught [gaming FrontierMath] a while back.”
From a primary source:
"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."
It happens with basically all papers on all topics. Benchmarks are useful when they are first introduced and used to measure things that were released before the benchmark. After that their usefulness rapidly declines.