BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.
I tried some large scale translation tasks with T5 and results were iffy at best. I’m going to try the same task with the newest Mistral small models and compare. My guess is Mistral will be better.
Translation with the Mistral 7B has been eye opening kind of sad it’s not for all languages but for the languages it does support it’s been awesome kind of exciting to think where everything will be in a few years
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.