CoLT5: Faster Long-Range Transformers With Conditional Computation

korec123 · on March 21, 2023

There is another line of work for efficient Transformers that wasn't mentioned in the paper, i.e., Adaptive Computation on the sequence level, which pools similar tokens which are easy to predict, therefore reducing the complexity of a Transformer layer. Examples [#1](https://arxiv.org/pdf/2211.09761.pdf) [#2](https://arxiv.org/pdf/2103.06874.pdf) [#3](https://arxiv.org/pdf/2110.13711.pdf).

ftxbro · on March 20, 2023

It looks good to me but is it right that the biggest model they tried was CoLT5-XL with only about 5B parameters and even this includes some that are zero from sparsity? My understanding is that it's not enough for some of amazing emergent things that the original formulation of transformers could do, but maybe the ideas of CoLT5 will still scale even for the large language models?

smaddox · on March 21, 2023

Sounds interesting.

Another recent paper on efficient architecture for long context lengths: https://arxiv.org/abs/2302.10866

varunkmohan · on March 21, 2023

This looks very cool and is a massive improvement. If only OpenAI would also publish what they are doing to make 32K work.

xiphias2 · on March 21, 2023

The most interesting aspect of this research is that Google published it. They have a huge competitor in this area, it would make sense to stop publishing unless they think that OpenAI already does this.

chem83 · on March 21, 2023

This may seem counter-intuitive, but model architecture is not the most significant differentiator in LLMs. Not publishing creates negative externalities around collaboration, recruiting etc. and any research organization worth their salt knows better not do avoid publishing. It's similar to the fallacy that Google should've never open-sourced Kubernetes as it allowed other cloud providers to out-compete them.

xiphias2 · on March 21, 2023

To me model architecture sounds the most important.

Stable diffusion was possible because diffusion model architecture and then doing it in latent space instead of image space and then upscaling.

Of course someone still needed to train it, but they did it when it was theoretically possible.

junipertea · on March 21, 2023

There was no competition in academic publishing until openai made it so! The research probably started way before gpt4 became a thing.

Mizza · on March 21, 2023

Every single decision Google makes is to avoid being seen by the government as a web advertising monopoly. Being seen as a competitor to adjacent tech companies is what they after, as they have no competitor in display advertising.

cma · on March 21, 2023

Researchers will work for less money if they are allowed to publish.

Zacharias030 · on March 21, 2023

this.

thesunkid · on March 20, 2023

Dynamic Transformer-layers allocation per step is intuitively and computationally appealing. Can't believe this idea has been under-explored for years.

f_devd · on March 21, 2023

This still isn't technically dynamic allocation since it always takes a top-k (constant k) tokens from the sequence, so more like dynamic routing, which was explored in Mixture-of-Expert models but only in Feed-Forward blocks and with a different routing scheme.

voxgen · on March 23, 2023

One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799

VHRanger · on March 20, 2023

> No code open sourced

> No model weights

For shame

tomatbebo · on March 21, 2023

That would be nice but at least they published their research

gwern · on March 21, 2023

Google releases a lot of the T5 models, for which they get insufficient credit, so CoLT5 may well be released at some point. But the process can take a while when it has to be run by lawyers and management, so check back in half a year...