It looks good to me but is it right that the biggest model they tried was CoLT5-XL with only about 5B parameters and even this includes some that are zero from sparsity? My understanding is that it's not enough for some of amazing emergent things that the original formulation of transformers could do, but maybe the ideas of CoLT5 will still scale even for the large language models?
The most interesting aspect of this research is that Google published it. They have a huge competitor in this area, it would make sense to stop publishing unless they think that OpenAI already does this.
This may seem counter-intuitive, but model architecture is not the most significant differentiator in LLMs. Not publishing creates negative externalities around collaboration, recruiting etc. and any research organization worth their salt knows better not do avoid publishing. It's similar to the fallacy that Google should've never open-sourced Kubernetes as it allowed other cloud providers to out-compete them.
Every single decision Google makes is to avoid being seen by the government as a web advertising monopoly. Being seen as a competitor to adjacent tech companies is what they after, as they have no competitor in display advertising.
Dynamic Transformer-layers allocation per step is intuitively and computationally appealing. Can't believe this idea has been under-explored for years.
This still isn't technically dynamic allocation since it always takes a top-k (constant k) tokens from the sequence, so more like dynamic routing, which was explored in Mixture-of-Expert models but only in Feed-Forward blocks and with a different routing scheme.
One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799
Google releases a lot of the T5 models, for which they get insufficient credit, so CoLT5 may well be released at some point. But the process can take a while when it has to be run by lawyers and management, so check back in half a year...