Hacker News new | past | comments | ask | show | jobs | submit login

Do you have a source for this? That's interesting (if true).



They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...


I don't see anything in the article about being caught. Maybe I missed something?


davidcbc is spreading fake rumors.

OpenAI was never caught cheating on it, because we didn’t cheat on it.

As with any eval, you have to take our word for it, but I’m not sure what more we can do. Personally, if I learned that a OpenAI researcher purposely or accidentally trained on it, and we didn’t quickly disclose this, I’d quit on the spot and disclose it myself.

(I work at OpenAI.)

Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals. The gray zone where things could go wrong is differing levels of care used in scrubbing training data of equivalent or similar problems. At some point the line between learning and memorizing becomes blurry. If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?


Your comment would be better without the personal swipe in the first sentence.

https://news.ycombinator.com/newsguidelines.html


>They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

Is anything here actually false or do you not like the conclusions that people may draw from it?


The false statement is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


Well, if we trust you. But you had the extra dataset and no one else did.

Why are you being disingenuous? Simply having access to the eval in question is already enough for your synthetics guys to match the distribution, and of course you don't contaminate directly on train, that would be stupid, and you would get caught, but if it does inform the reward, the result is the same. You _should_ quit, but you wouldn't because you'd already convinced yourself you're doing RL God's work, not sleight of hand.

> If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?

This kind of disingenuous bullshit is exactly why people call you cheaters.

> Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals.

You guys should follow Apple's cult guidelines: never stand out. Think different


From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


Which part of my post was incorrect?

Sorry you got caught with your hand in the cookie jar, but you can take comfort in others doing the same I guess


Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html


What about the accusation that I was spreading false rumors? That's far more unkind


The guidelines apply to everyone. I didn't like that part of their comment either, and the comment would have been better without it, but they were put on the defensive by what had come before and they went on to explain their position. Your phrase "got caught with your hand in the cookie jar" is a swipe that the thread could also have done without. It's no big deal, and it applies to everyone on the subthread; we want everyone to avoid barbs like that on HN.

The incorrect part is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986


This is like saying “oh no we didn’t cheat because they didn’t tell us the answer out loud” when someone winked at you instead.

Stop being obtuse.


Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html


It happens with basically all papers on all topics. Benchmarks are useful when they are first introduced and used to measure things that were released before the benchmark. After that their usefulness rapidly declines.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: