Hacker News new | past | comments | ask | show | jobs | submit login

I was completely surprised that the reasoning comes from within the model. When using gpt-o1 I thought it's actually some optimized multi-prompt chain, hidden behind an API endpoint.

Something like: collect some thoughts about this input; review the thoughts you created; create more thoughts if needed or provide a final answer; ...






I think the reason why it works is also because chain-of-thought (CoT), in the original paper by Denny Zhou et. al, worked from "within". The observation was that if you do CoT, answers get better.

Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.


One big question will be whether chain of thought within the embedding space will work better than in the token space.

This recent paper is relevant: https://arxiv.org/abs/2412.06769

Do you understand why RL is better than SFT for training on reasoning traces?

I always assumed the reason is that you are working with the pretrained model rather than against it. Whatever “logic” rules or functions the model came up with to compress (make more sense of) the vast amounts of pretraining data, it then uses the same functions during RL. Of course, distillation from a strong, huge model might still help more than RL directly applied on the small model because the strong model came up with much better functions/reasoning during pretraining, which the small model can simply copy. These models all learn in different ways than most humans, so human-based SFT can only go so far.

SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: