So now not only are the models closed, but so are their evals?! This is a "semi-...

ZeroCool2u · 2024-12-20T18:37:53 1734719873

The private evaluation set is private from the public/OpenAI so companies can't train on those problems and cheat their way to a high score by overfitting.

jsheard · 2024-12-20T18:45:36 1734720336

If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.

foobarqux · 2024-12-20T19:04:27 1734721467

Yes that's why it is "semi"-private: From the ARC website "This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated."

I presume evaluation on the test set is gated (you have to ask ARC to run it).

cchance · 2024-12-20T18:38:52 1734719932

the evals are the question/answers, ARC-AGI doesn't share the questions and answers for a portion so that models can't be trained on them, the public ones... the public knows the questions so theres a chance they could have been at least partially been trained on the question (if not the actual answer).

Thats how i understand it