Hacker News new | past | comments | ask | show | jobs | submit login
What happened to BERT and T5? (yitay.net)
251 points by fzliu 5 months ago | hide | past | favorite | 68 comments



Maybe in SOTA ml/nlp research, but in the world of building useful tools and products, BERT models are dead simple to tune, work great if you have decent training data, and most importantly are very very fast and very very cheap to run.

I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.


Latency, throughput and cost are still very important for many applications.

Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.

Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.

That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.


Scores are also interesting in that you can get 1.0 match on a classification task, but if the model is dog shit it’s 1.0 of dog shit.

I’m still struggling with the degree to which I want to expose raw scores to users for that reason.

On the other hand, sometimes a document that slightly above an arbitrary threshold isn’t great, or a document slightly below an arbitrary threshold may be fine.

I’m excited about how easy it is to do this stuff, as the tooling is sophisticated enough now, you don’t need to know too much about the underlying mechanisms to do things that are useful. Once you get into those very fine distinction, it’s still very difficult work.


Want to share your collection with the class so we can all learn? Seems useful.


Product in stealth for a little bit longer, so can’t say much. :-)


What does your swiss army collection do?


Document classification in highly ambiguous contextual space. Solving some specific large scale classification tasks, so multi million document sets.


What technique do you use to get BERT to work on longer documents?


512 has been sufficient to solve my problems. I had done some initial attempts with BigBird that weren’t going well, but then realized I didn’t really need it.

I may revisit at some point.


Yeah, pretty much. When you have 2b files you need to troll through good luck using anything but a vector database. Once you do a level or two of pruning of the results then you can feed it into an LLM for final classification.


BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.

T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.


Aside from being used alone, T5 is also used as the text encoder of some recent multimodal models.

https://stability.ai/news/stable-diffusion-3-research-paper

https://t5tts.github.io/

Related discussion

https://www.reddit.com/r/StableDiffusion/comments/1c0by2y/wh...


I tried some large scale translation tasks with T5 and results were iffy at best. I’m going to try the same task with the newest Mistral small models and compare. My guess is Mistral will be better.


Translation with the Mistral 7B has been eye opening kind of sad it’s not for all languages but for the languages it does support it’s been awesome kind of exciting to think where everything will be in a few years


T5 is not Bert, translation is not embedding.


The article mentions T5 and translation is something T5 is supposedly good at - just sharing I was less than impressed.


For people like me who gave up trying to follow Arxiv ML papers 3+ years ago, articles like these are gold. I would love a Youtube channel or blog which does retrospectives on "big" papers of the last decade (those that everyone paid attention to at the time) and look at where the ideas are today.


yes. my pics of youtubers (some news, but some paper reviews too) here https://github.com/swyxio/ai-notes/blob/main/Resources/Good%...


All you need is uninterrupted attention


Maybe one needs additional heads of attention?


Flash attention


BERT is still the most downloaded LM at huggingface with 46M downloads last month. XLM Roberta has 24M and Distilbert is at 15M. I feel like BERTs are doing okay.


I'm a bit embarrassed to admit, but I still don't understand decoder vs encoder vs decoder/encoder models.

Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?


Encoder: Text tokens -> Fixed representation vector

Decoder: Fixed representation vector + N decoded text tokens -> N+1th text token

Encoder/Decoder architecture: You take some tokenized text, run an encoder on it to get a fixed representation vector, and then recursively apply the decoder to your fixed representation vector and the 0...N tokens you've already produced to produce the N+1th token.

Decoder-only architecture: You take some tokenized text, and recursively apply a decoder to the 0...N tokens you've already produced to produce the N+1th token (without ever using an encoded representation vector).

Basically, an encoder produces this intermediate output which a decoder knows how to combine with some existing output to create more output (imagine, e.g., encoding a sentence in French, and then feeding a decoder the vector representation of that sentence plus the three words you've translated so far, so that it can figure out the next word in the translation). A decoder can be made to require an intermediate context vector, or (this is how it's done in decoder-only architectures) it can be made to require only the text produced so far.


Encoder in the T5 sense doesn't produce a fixed vector, it produces one encoded vector for every step of input and all of that is given to the decoder.

The only difference between encoder/decoder and decoder-only is masking:

In an encoder, none of the tokens are masked at any step, and are all visible in both directions to the encoder. Each output of the encoder can attend to any input of the encoder.

In the decoder, the tokens are masked causally - each N+1 token can only attend to the previous N tokens.


You can think of encoder/decoder models as specifically addressing the translation problem. They are also known as sequence-to-sequence models.

Take the task of translation. A translator needs to keep in mind the original text and the translation so far in order to predict the next translated token. The original text is encoded, and the translation so far is passed into the decoder to generate the next translated token. The next token is appended to the translation and the process repeats autoregressively.

Decoder-only models use just the decoder architecture of encoder/decoders. They are prompted and generate completions autoregressively.

Encoder-only models use just the encoder architecture which you can think of similarly to embedding. A task here is, producing vectors where vector distance is related to the semantic similarity of the input documents. This can be useful for retrieval tasks among other things.

You can of course translate using just the decoder, by constructing a "please translate this from A to B, <original text>" prompt and generating tokens just using the decoder. I'll leave it to people with more expertise than I do describe the pros and cons of these.


The biggest difference is when you feed a sequence into a decoder only model, it will only attend to previous tokens when computing hidden states for the current token. So the hidden states for the nth token is only based on tokens <n. This is where you hear the talk about "causal masking", as the attention matrix is masked to achieve this restriction. Encoder architectures on the other hand allow for each position in the sequence to attend to every other position in the sequence.

Encoder architectures have been used for semantic analysis, and feature extraction of sequences, and encoder only for generation (i.e. next token prediction).


Don't be embarrassed. This article makes the mistake of _saying_ they're going catch the under-informed up to speed but then immediately dives all the way in to the deep end.


The key to understanding the difference is that transformers are attention models where tokens can "attend" to different tokens.

Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.

Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.


This is wrong.

The term for models that look only at previous tokens in the sequence is auto-regressive.

Encoder and decoder has nothing to do with this.


arent a lot of transformers built in a way where attention is only applied to previous tokens in sequence, even though its fully possible to apply it both ways?


That's the autoregressive aspect. The decoder aspect is that the last layer converts representations into output sequences (and the generation happens autoregressively, one at a time). Similarly at the last layer an encoder outputs a representation/embedding (while being able to attend to the entire sequence).


But this has nothing to do with encoding and decoding.


If you look at the classical [transformer architecture picture](https://en.wikipedia.org/wiki/Transformer_(deep_learning_arc...) there is an "encoder" tower on the left and a "decoder" tower on the right.

- Bert is encoder only.

- GPT is decoder only.

- T5 uses both the encoder and the decoder.


I think the big reason why BERT and T5 have fallen out of favor is the lack of zero shot (or few shot) ability.

When you have hundreds or thousands of examples, BERT works great. But that is very restricting.


Yes but you can use an llm to label data and then train a bert model which then costs a small fraction of time and money to run than the original llm.


Shhh, don’t tell everybody the secret. ;-)


Lol isn't everyone doing it? That's how I bootstraped my BERT fine-tunes.


I would say everybody smart is doing that, but a lot of the dumb money in AI right now is just wrappers on the GPT API That makes for a flashy demo with no underlying substance or expertise.


Is the encoder style arch better for representing classification tasks at a given compute budget than a causal LM?

Is this because the final represention in bert style models more globally focused, rather than being optimized for next token prediction?


They are 100% better for classification at a given compute budget. They can account for information before and after e.g. a token for token classification and use that information to classify.


Yes, no zero shot. Few shot is possible for some use cases with setfit: https://github.com/huggingface/setfit and the very recent Fastfit: https://github.com/IBM/fastfit ( https://arxiv.org/pdf/2404.12365 )


They are there, you just have to look. Tasksource, NuNER, Flan, T0. There’s not a lot, but still at least a few good zero shot models in both architectures.


It's because you need to mess with embeddings or even train new heads on top of a network to use it. LLMs just use tokens-in tokens-out, they don't classify with softmax over classes, they softmax over vocabulary tokens. LLMs are more convenient


What happened is that "transformers go whrrrrrr." (yes, that's the academic term)

In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.


I believe many high-quality embedding models are still based on BERT, even recent ones, so I don't think it's entirely fair to characterize it as "deprecated".


DNABERT-S came out half a year ago: seems like xBERT is still useful in genomics/DNA? https://arxiv.org/abs/2402.08777


feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.


No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.


GPTs also get gradients from all tokens, BERT only on 15% masked tokens. GPTs are more effective.


Call it a CLM vs MLM, not LLM vs MLM. Soon LMLM's will exist, which will be LLMs too...


T5 is LLM, I think first one of them.


> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio.

Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.


You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.


I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.


Wasn’t there a recent paper that demonstrated BERT models are still competitive or beat LLMs in many tasks?


Yi is a good source in this area, and a good follow on Twitter.


we did the first podcast interview with him a few weeks ago if interested to learn more https://www.latent.space/p/yitay


IMO GenAI gets all the hype, but in the industry, the robustness (ig. does not hallucinate) of Extractive models is much appreciated.


>If BERT worked so well, why not scale it?

I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.


For text classification/clustering/retrieval I am pretty happy with BERT-family models. It's only the last few month that I've seen better models come out that are practical (e.g. not sell all your children to Open AI to afford them)


What would you say are the better models nowadays that are practical?


For my recommender/object sorter I have not been in a hurry to upgrade because I have other things to think about. This table should give you some idea of the time-space-accuracy trade offs

https://huggingface.co/spaces/mteb/leaderboard

In a lot of cases you will see two models with a huge difference in size but a tiny difference in accuracy. I could fit either the big or small Stella on my 4080.


Classification is just too damn convenient with LLMs.


* without already labelled training data (assuming you're referring to causal LLMs).

If you have labelled training data (or semi-labelled), BERT takes the cake, both in terms of accuracy and efficiency. In fact, you can have luck with getting a CLM to generate noisy labels and then training BERT/RoBERTa on that to get a robust strong classifier.


Between my experience and arXiv papers I've read I'd say this:

Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.

You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.

You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.

My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.

I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.


You're better off trying to figure out what features you like about articles that are less ambiguous as signal. You would then be able to finetune models to classify whether those features are present. Whether it's classificaton of chunks/sentences/tokens. For these, a bert model could be fine tuned to efficiently detect it.


nit: I find the writing in this post very distracting. (Grammar and style pet peeves)

Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"

So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: