The mixing step, is computed with the trained weights, meaning the model does learn on its own, when/what to emphasise in this "mixing" process.
Hypothetically speaking, if a token does not matter, it mixing weights could end up being literally 0 (or something close to it)