Yes you're right, in their paper I think they say the process of sampling multip...

zby · 2025-01-25T22:42:30 1737844950

Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.