ARC is a silly benchmark, the other results in math and coding are much more imp...

ARC is a silly benchmark, the other results in math and coding are much more impressive.

o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers.

Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token. Explained in simpler terms, it means it can get reward for any implementation of a function that reproduces the expected result, instead of the exact implementation in the training set.

Put another way, it’s just like RLHF but instead of optimizing against learned human preferences, the model is trained to satisfy a verifier.

This should work just as well in VLA models for robotics, self driving and computer agents.