Hacker News new | past | comments | ask | show | jobs | submit login

Not only that - if you have {0,0,0,0.01} - then the probability that you would get any reward at one shot would be very low. And also I have the intuition that giving the rewards to traces at the edge is more efficient - because the model needs only a small perturbation to get right. If you gave negative rewards to traces that are very far from being right - then the model might be steered in a wrong direction.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: