Hacker News new | past | comments | ask | show | jobs | submit login
TinyZero (github.com/jiayi-pan)
200 points by fzliu 1 day ago | hide | past | favorite | 27 comments





So to my understanding, this work reproduces DeepSeek R1's reinforcement learning mechanism in a very small language model.

The AI gets "rewards" (like points) for doing two things correctly:

Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.

Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly.

So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward.

Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.


I've been trying to follow the literature on PPO/GRPO as applied to LLMs. From what I understand, since reward is only given once the entire COT sequence is sampled, traditional RL techniques would require some form of credit-assignment to distribute that reward amongst individual tokens – which is where the critic/value network comes in, right?

Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy?


I don't think it's only using sparse rewards because of the format rewards. The training recipe is pretty comprehensive and involves multiple stages.[1] The paper mentions that when only using the RL technique, the output is often not suitable for reading. (Language mixing, etc) That feels like a AlphaZero moment for LLMs?

[1]: https://www.reddit.com/r/LocalLLaMA/comments/1i8rujw/notes_o...


The R1 paper says that they didn't use "process reward modeling". And the paper that introduced GPRO says that it can be used either with "outcome supervision" or "process supervision", with outcome supervision "only provid[ing] a reward at the end of each output". Put together, doesn't that imply R1 uses sparse rewards provided only at end of COT sequence?

Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1. Your "sparse reward" means only providing reward at the end of each output.

> Ah sorry, you might be right. I meant "sparse reward" as a reward system that is mostly 0 but occasionally 1.

Did we introduce the abusive pressure of Korean educational culture to machines?


I think the reward is relative to other sampled answers for the same question. This way the signal is strong at the very margin of what is possible with a given model and there is less noise in it with impossible or too easy questions.

There is some confusion - because they do compute that simple reward, but then they convert it to a relative value and call it advantage. And I think they use that advantage to update the model - not the base reward.


Yes you're right, in their paper I think they say the process of sampling multiple traces then taking relative rewards is supposed to monte-carlo approximate the value network? I don't really have the intuition for that, but it does make sense that rather than simply nudging probabilities in the direction of the trace with the highest absolute reward, you want to favor the trace which had the best reward relative to current state. E.g. for quick intuition if absolute rewards for traces were {0, 0, 0, 0.01} then using absolute rewards would only give a weak signal (nudge weights proportional to 0.01 * logprob) for the last trace, but using relative rewards (based on z-score) of 1.5 * logprob.

Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.

It looks like the 'old-school' RL to me, which makes me wonder why it took so long to get here

Nothing like acronyms to make me feel dumb and ill-informed.


The part I found strange: these RL formulations give no reward for incorrect solutions, so unless there are training examples that are easy enough for the base model to solve, the RL process won’t do anything.

So is the actual magic that the base models are good enough to sometimes generate successful CoT output in their unmodified state? Or did I miss something in the R1 paper and the code here?


I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.

I was wondering the same thing. I feel there is too large of a gap between a raw base model and and a model that produces fully correct answers and follows a specific format. My guess is their rule base reward system is more nuanced than just correctness and format.

Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness.

They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.


The author notes in their Twitter announcement [a] that their model’s reasoning abilities are only validated within the domain directly within their the domain of their Countdown training material. They admit that the real test of this training method is whether it produces outputs that pass the sniff test in other subject domains, or even abstract reasoning. However, given that there are “standardized test style” abstract reasoning tests with relatively small corpora (eg. ZebraLogic [b] on the order of 1000 or so cases), I do think they missed an opportunity to… do _some_ small benchmark for abstract reasoning before announcement.

[a] https://threadreaderapp.com/thread/1882839370505621655.html - thanks @Tepix

[b] https://huggingface.co/blog/yuchenlin/zebra-logic


Unrolled non-X link with the announcement: https://threadreaderapp.com/thread/1882839370505621655.html

What does it mean to reproduce DeepSeek R1-Zero? Like they have a model of equivalent performance? Is there a simple explanation of this post for those who aren't machine learning experts?

Also is the technique here related at all to the technique people think DeepSeek themselves used, where they apparently trained the model using OpenAI outputs?


Reminds me of old polish encyclopedia: horse - everybody knows what horse is

https://en.wikipedia.org/wiki/Nowe_Ateny


R1-Zero is trained differently than most reasoning models, such as the "normal" R1 model, in regards what steps are done in training. TinyZero applies the same approach (but only on a subset of use cases) on a much smaller model to show it can apply on much smaller models as well.

The details of how it's trained different start to get into "machine learning expert" territory but you can get a decent high level via a casual read through of the DeepSeek link if you want to dive deeper.


reproducing the alphazero-like "model learns to reason on its own without supervised fine-tuning" phenomenon that deepseek-r1-zero exhibited

Could you provide source for the training the model on OpenAI outputs? I can’t find any news about that.

I don't have a source to share, but I saw this claim on social media a few times in the last couple days, where people said their conversation with the model revealed that it thought it was some other OpenAI model. I have no idea how such training can work using another model's output, but I saw comments claiming that this is why their training was so cheap.

I think there are 2 levels in the brain.

One is used for programming the other for language. Doing them in parallel fails for some reason.

A lot of GH projects just don't have solid explanation - i don't know what they built.


> What does it mean to reproduce DeepSeek R1-Zero?

means it's reproducible


Westerners cant reproduce chinese geniuses



Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: