I think is where the relative rewards come to play - they sample many thinking t...

zby 5 months ago | parent | context | favorite | on: TinyZero: Reproduction of DeepSeek R1 Zero in coun...

I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.