I was completely surprised that the reasoning comes from within the model. When using gpt-o1 I thought it's actually some optimized multi-prompt chain, hidden behind an API endpoint.
Something like: collect some thoughts about this input; review the thoughts you created; create more thoughts if needed or provide a final answer; ...
I think the reason why it works is also because chain-of-thought (CoT), in the original paper by Denny Zhou et. al, worked from "within". The observation was that if you do CoT, answers get better.
Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.
I always assumed the reason is that you are working with the pretrained model rather than against it. Whatever “logic” rules or functions the model came up with to compress (make more sense of) the vast amounts of pretraining data, it then uses the same functions during RL. Of course, distillation from a strong, huge model might still help more than RL directly applied on the small model because the strong model came up with much better functions/reasoning during pretraining, which the small model can simply copy. These models all learn in different ways than most humans, so human-based SFT can only go so far.
SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer
Something like: collect some thoughts about this input; review the thoughts you created; create more thoughts if needed or provide a final answer; ...