Hacker News new | past | comments | ask | show | jobs | submit login

While all other tokens are considered, the attention mechanism is putting an individual weight on each one, in a way "paying more attention" to some than others.



Yup, if you follow this definition of attention. It makes sense.

The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.

Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: