What's your opinion on the veracity of this benchmark - given o3 was fine-tuned ...

nmca · 2024-12-21T13:24:23 1734787463

I can’t provide more information than is currently public, but from the ARC post you’ll note that we trained on about 75% of the train set (which contains 400 examples total); which is within the ARC rules, and evaluated on the semiprivate set.

idontknowmuch · 2024-12-21T18:04:32 1734804272

That's completely understandable - leveraging the train set. But what I was trying to say is that the comparison is relative to models that were actually zero-shot and not tuned. It isn't apples to apples, it's apples to orchards.