could someone explain how the RL works here? I don't understand how it can be a ...

jsenn · 2025-01-26T16:51:39 1737910299

> To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards:

> Accuracy rewards: The accuracy reward model evaluates whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases.

> Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between ‘<think>’ and ‘</think>’ tags.

This is a post-training step to align an existing pretrained LLM. The state space is the set of all possible contexts, and the action space is the set of tokens in the vocabulary. The training data is a set of math/programming questions with unambiguous and easily verifiable right and wrong answers. RL is used to tweak the model's output logits to pick tokens that are likely to lead to a correctly formatted right answer.

(Not an expert, this is my understanding from reading the paper.)