Do you have a source for this? That's interesting (if true).

davidcbc · 2025-04-08T16:00:53 1744128053

They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...

tananaev · 2025-04-08T16:32:30 1744129950

I don't see anything in the article about being caught. Maybe I missed something?

tedsanders · 2025-04-08T20:47:35 1744145255

davidcbc is spreading fake rumors.

OpenAI was never caught cheating on it, because we didn’t cheat on it.

As with any eval, you have to take our word for it, but I’m not sure what more we can do. Personally, if I learned that a OpenAI researcher purposely or accidentally trained on it, and we didn’t quickly disclose this, I’d quit on the spot and disclose it myself.

(I work at OpenAI.)

Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals. The gray zone where things could go wrong is differing levels of care used in scrubbing training data of equivalent or similar problems. At some point the line between learning and memorizing becomes blurry. If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?

tomhow · 2025-04-09T07:52:38 1744185158

Your comment would be better without the personal swipe in the first sentence.

https://news.ycombinator.com/newsguidelines.html

suddenlybananas · 2025-04-08T20:53:52 1744145632

>They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

Is anything here actually false or do you not like the conclusions that people may draw from it?

tedsanders · 2025-04-10T02:27:26 1744252046

The false statement is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986

suddenlybananas · 2025-04-10T07:19:15 1744269555

Well, if we trust you. But you had the extra dataset and no one else did.

tucnak · 2025-04-08T22:09:33 1744150173

Why are you being disingenuous? Simply having access to the eval in question is already enough for your synthetics guys to match the distribution, and of course you don't contaminate directly on train, that would be stupid, and you would get caught, but if it does inform the reward, the result is the same. You _should_ quit, but you wouldn't because you'd already convinced yourself you're doing RL God's work, not sleight of hand.

> If an MMLU question asks about an abstract algebra proof, is it cheating to have trained on papers about abstract algebra?

This kind of disingenuous bullshit is exactly why people call you cheaters.

> Generally, I don’t think anyone here is cheating and I think we’re relatively diligent with our evals.

You guys should follow Apple's cult guidelines: never stand out. Think different

tedsanders · 2025-04-10T03:32:41 1744255961

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986

davidcbc · 2025-04-09T03:36:32 1744169792

Which part of my post was incorrect?

Sorry you got caught with your hand in the cookie jar, but you can take comfort in others doing the same I guess

tomhow · 2025-04-09T05:30:57 1744176657

Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html

davidcbc · 2025-04-09T06:50:59 1744181459

What about the accusation that I was spreading false rumors? That's far more unkind

tomhow · 2025-04-09T07:36:57 1744184217

The guidelines apply to everyone. I didn't like that part of their comment either, and the comment would have been better without it, but they were put on the defensive by what had come before and they went on to explain their position. Your phrase "got caught with your hand in the cookie jar" is a swipe that the thread could also have done without. It's no big deal, and it applies to everyone on the subthread; we want everyone to avoid barbs like that on HN.

tedsanders · 2025-04-10T02:29:32 1744252172

The incorrect part is “OpenAI got caught [gaming FrontierMath] a while back.”

From a primary source:

"OpenAI did not use FrontierMath data to guide the development of o1 or o3, at all.... we only downloaded FrontierMath for our evals long after the training data was frozen, and only looked at o3 FrontierMath results after the final announcement checkpoint was already picked."

https://x.com/__nmca__/status/1882563755806281986

FridgeSeal · 2025-04-09T02:19:50 1744165190

This is like saying “oh no we didn’t cheat because they didn’t tell us the answer out loud” when someone winked at you instead.

Stop being obtuse.

tomhow · 2025-04-09T05:31:34 1744176694

Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.

https://news.ycombinator.com/newsguidelines.html

wongarsu · 2025-04-08T19:30:39 1744140639

It happens with basically all papers on all topics. Benchmarks are useful when they are first introduced and used to measure things that were released before the benchmark. After that their usefulness rapidly declines.