My hot take is that if you dont do the trick, you basically get a mean of all vectors in the value matrix if all x are very small. Which then probably the next sequence of linear layers will be able to interpret the same way as if you do the +1 trick and prodce a 0?