I don't see any results, it'd be more impactful and convincing if there were numbers supplementing the theory. It's not that hard to finetune existing LM on a small data and verify that it works.
I am, however, of the similar opinion that there could be better attention formulations. A paper from 2020 https://arxiv.org/abs/2005.09561 helped a lot in one of the transformers model I trained (not a vanilla LM but a specialised multi-modal graph problem).
It proposes normalised attention which if I'm not wrong should help with the quantisation problem too.
I am, however, of the similar opinion that there could be better attention formulations. A paper from 2020 https://arxiv.org/abs/2005.09561 helped a lot in one of the transformers model I trained (not a vanilla LM but a specialised multi-modal graph problem).
It proposes normalised attention which if I'm not wrong should help with the quantisation problem too.