> First, if an attention node has low confidence, it can already assign similar scores pre-softmax. Then we get what looks like a uniform distribution as output.
Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.
Disagree here, I think neural nets are quite bad at implicitly learning low entropy transforms, similar to how they struggle to model the identity function, necessitating residual connections. In both cases the change doesn't increase expressivity, but it does bake these needle-in-a-haystack transformations into the model that may be hard to access with gradient descent.
Can't speak to how useful it is though.