Yes you're right, in their paper I think they say the process of sampling multiple traces then taking relative rewards is supposed to monte-carlo approximate the value network? I don't really have the intuition for that, but it does make sense that rather than simply nudging probabilities in the direction of the trace with the highest absolute reward, you want to favor the trace which had the best reward relative to current state. E.g. for quick intuition if absolute rewards for traces were {0, 0, 0, 0.01} then using absolute rewards would only give a weak signal (nudge weights proportional to 0.01 * logprob) for the last trace, but using relative rewards (based on z-score) of 1.5 * logprob.
Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.