Probably it's something like "give feedback that's on average slightly more corr...

scarmig 89 days ago | parent | context | favorite | on: An analysis of DeepSeek's R1-Zero and R1

Probably it's something like "give feedback that's on average slightly more correct than incorrect," though you'd get more signal from perfect feedback.

That said, I suspect the signal is very weak even today and probably not too useful except for learning about human stylistic preferences.