So now not only are the models closed, but so are their evals?! This is a "semi-private" eval. WTH is that supposed to mean? I'm sure the model is great but I refuse to take their word for it.
The private evaluation set is private from the public/OpenAI so companies can't train on those problems and cheat their way to a high score by overfitting.
If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.
Yes that's why it is "semi"-private: From the ARC website "This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated."
I presume evaluation on the test set is gated (you have to ask ARC to run it).
the evals are the question/answers, ARC-AGI doesn't share the questions and answers for a portion so that models can't be trained on them, the public ones... the public knows the questions so theres a chance they could have been at least partially been trained on the question (if not the actual answer).