Though I agree with the idea that MLPs are theoretically more "capable" than transformers, I think seeing them just as a parameter reduction technique is also excessively reductive.
Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.
In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).
They did not become popular because they were modest with respect to the parameters used.
Many have tried to build deep and large MLPs for a long time, but at some point adding more parameters wouldn't increase models' performance.
In contrast, transformers became so popular because their modelling power just kept scaling with more and more data and more and more parameters. It seems like the 'restriction' imposed on transformaters (the attention structure) is a verg good functional form for modelling language (and, more and more, some tasks in vision and audio).
They did not become popular because they were modest with respect to the parameters used.