I was completely surprised that the reasoning comes from within the model. When ...

piecerough · 2025-01-25T22:19:55 1737843595

I think the reason why it works is also because chain-of-thought (CoT), in the original paper by Denny Zhou et. al, worked from "within". The observation was that if you do CoT, answers get better.

Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.

singularity2001 · 2025-01-26T08:23:51 1737879831

One big question will be whether chain of thought within the embedding space will work better than in the token space.

kevinventullo · 2025-01-26T17:13:25 1737911605

This recent paper is relevant: https://arxiv.org/abs/2412.06769

robrenaud · 2025-01-26T01:02:35 1737853355

Do you understand why RL is better than SFT for training on reasoning traces?

pama · 2025-01-26T04:46:57 1737866817

I always assumed the reason is that you are working with the pretrained model rather than against it. Whatever “logic” rules or functions the model came up with to compress (make more sense of) the vast amounts of pretraining data, it then uses the same functions during RL. Of course, distillation from a strong, huge model might still help more than RL directly applied on the small model because the strong model came up with much better functions/reasoning during pretraining, which the small model can simply copy. These models all learn in different ways than most humans, so human-based SFT can only go so far.

piecerough · 2025-01-26T10:13:59 1737886439

SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer