Yes and no. In their paper they said they trained two models. One is purely RL b...

Yes and no. In their paper they said they trained two models. One is purely RL based (R1Zero). So this one is trained like you described, i.e. it has to stumble upon the correct answer. They found it to be good but has problems like repetition and language mixing.

The main R1 model was first finetuned with synthetic CoT data before going through RL IIUC.