Hacker News new | past | comments | ask | show | jobs | submit login

My hot take is that if you dont do the trick, you basically get a mean of all vectors in the value matrix if all x are very small. Which then probably the next sequence of linear layers will be able to interpret the same way as if you do the +1 trick and prodce a 0?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: