It looks good to me but is it right that the biggest model they tried was CoLT5-XL with only about 5B parameters and even this includes some that are zero from sparsity? My understanding is that it's not enough for some of amazing emergent things that the original formulation of transformers could do, but maybe the ideas of CoLT5 will still scale even for the large language models?