Maybe in SOTA ml/nlp research, but in the world of building useful tools and products, BERT models are dead simple to tune, work great if you have decent training data, and most importantly are very very fast and very very cheap to run.
I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.
Latency, throughput and cost are still very important for many applications.
Also the output of a purpose-built encoder model is preferable to natural language. Not only is it unambiguous, but scores are often an important part of the result.
Last, if you need to get into some advanced methods of training, like pseudolabeling and semi-supervised learning, there’s different options and outlets for utilizing real world datasets.
That said, I’m not sure there’s much value in scaling up current encoder models. It seems like there’s already a point of diminishing returns.
Scores are also interesting in that you can get 1.0 match on a classification task, but if the model is dog shit it’s 1.0 of dog shit.
I’m still struggling with the degree to which I want to expose raw scores to users for that reason.
On the other hand, sometimes a document that slightly above an arbitrary threshold isn’t great, or a document slightly below an arbitrary threshold may be fine.
I’m excited about how easy it is to do this stuff, as the tooling is sophisticated enough now, you don’t need to know too much about the underlying mechanisms to do things that are useful. Once you get into those very fine distinction, it’s still very difficult work.
512 has been sufficient to solve my problems. I had done some initial attempts with BigBird that weren’t going well, but then realized I didn’t really need it.
Yeah, pretty much. When you have 2b files you need to troll through good luck using anything but a vector database. Once you do a level or two of pruning of the results then you can feed it into an LLM for final classification.
BERT didn’t go anywhere and I have seen fine-tuned BERT backbones everywhere. They are useful for generating embeddings to be used downstream, and small enough to be handled on consumer (pre Ampere) hardware. One of the trends I have seen is scaling BERT down rather than up, since BERT already gave good performance, we want to be able to do it faster and cheaper. That gave rise to RoBERTa, ALBERT and distillBERT.
T5 I have worked less with but I would be curious about its head to head performance with decoder-only models these days. My guess is the downsides from before (context window limitations) are less of a factor than they used to be.
I tried some large scale translation tasks with T5 and results were iffy at best. I’m going to try the same task with the newest Mistral small models and compare. My guess is Mistral will be better.
Translation with the Mistral 7B has been eye opening kind of sad it’s not for all languages but for the languages it does support it’s been awesome kind of exciting to think where everything will be in a few years
For people like me who gave up trying to follow Arxiv ML papers 3+ years ago, articles like these are gold. I would love a Youtube channel or blog which does retrospectives on "big" papers of the last decade (those that everyone paid attention to at the time) and look at where the ideas are today.
BERT is still the most downloaded LM at huggingface with 46M downloads last month. XLM Roberta has 24M and Distilbert is at 15M. I feel like BERTs are doing okay.
I'm a bit embarrassed to admit, but I still don't understand decoder vs encoder vs decoder/encoder models.
Is the input/output of these models any different? Are they all just "text context goes in, scores for all tokens in the vocabulary come out" ? Is the difference only in how they achieve this output?
Encoder: Text tokens -> Fixed representation vector
Decoder: Fixed representation vector + N decoded text tokens -> N+1th text token
Encoder/Decoder architecture: You take some tokenized text, run an encoder on it to get a fixed representation vector, and then recursively apply the decoder to your fixed representation vector and the 0...N tokens you've already produced to produce the N+1th token.
Decoder-only architecture: You take some tokenized text, and recursively apply a decoder to the 0...N tokens you've already produced to produce the N+1th token (without ever using an encoded representation vector).
Basically, an encoder produces this intermediate output which a decoder knows how to combine with some existing output to create more output (imagine, e.g., encoding a sentence in French, and then feeding a decoder the vector representation of that sentence plus the three words you've translated so far, so that it can figure out the next word in the translation). A decoder can be made to require an intermediate context vector, or (this is how it's done in decoder-only architectures) it can be made to require only the text produced so far.
Encoder in the T5 sense doesn't produce a fixed vector, it produces one encoded vector for every step of input and all of that is given to the decoder.
The only difference between encoder/decoder and decoder-only is masking:
In an encoder, none of the tokens are masked at any step, and are all visible in both directions to the encoder. Each output of the encoder can attend to any input of the encoder.
In the decoder, the tokens are masked causally - each N+1 token can only attend to the previous N tokens.
You can think of encoder/decoder models as specifically addressing the translation problem. They are also known as sequence-to-sequence models.
Take the task of translation. A translator needs to keep in mind the original text and the translation so far in order to predict the next translated token. The original text is encoded, and the translation so far is passed into the decoder to generate the next translated token. The next token is appended to the translation and the process repeats autoregressively.
Decoder-only models use just the decoder architecture of encoder/decoders. They are prompted and generate completions autoregressively.
Encoder-only models use just the encoder architecture which you can think of similarly to embedding. A task here is, producing vectors where vector distance is related to the semantic similarity of the input documents. This can be useful for retrieval tasks among other things.
You can of course translate using just the decoder, by constructing a "please translate this from A to B, <original text>" prompt and generating tokens just using the decoder. I'll leave it to people with more expertise than I do describe the pros and cons of these.
The biggest difference is when you feed a sequence into a decoder only model, it will only attend to previous tokens when computing hidden states for the current token. So the hidden states for the nth token is only based on tokens <n. This is where you hear the talk about "causal masking", as the attention matrix is masked to achieve this restriction. Encoder architectures on the other hand allow for each position in the sequence to attend to every other position in the sequence.
Encoder architectures have been used for semantic analysis, and feature extraction of sequences, and encoder only for generation (i.e. next token prediction).
Don't be embarrassed. This article makes the mistake of _saying_ they're going catch the under-informed up to speed but then immediately dives all the way in to the deep end.
The key to understanding the difference is that transformers are attention models where tokens can "attend" to different tokens.
Encoder models allow all tokens to attend to every other token. This increases the number of connections and makes it easier for the model to reason, but requires all tokens at once to produce any output. These models generally can't generate text.
Decoder models only allow tokens to attend to previous tokens in the sequence. This decreases the amount of tokens, but allows the model to be run incrementally, one token at a time. This incremental processing is key to allowing the models to generate text.
arent a lot of transformers built in a way where attention is only applied to previous tokens in sequence, even though its fully possible to apply it both ways?
That's the autoregressive aspect. The decoder aspect is that the last layer converts representations into output sequences (and the generation happens autoregressively, one at a time). Similarly at the last layer an encoder outputs a representation/embedding (while being able to attend to the entire sequence).
I would say everybody smart is doing that, but a lot of the dumb money in AI right now is just wrappers on the GPT API That makes for a flashy demo with no underlying substance or expertise.
They are 100% better for classification at a given compute budget. They can account for information before and after e.g. a token for token classification and use that information to classify.
They are there, you just have to look. Tasksource, NuNER, Flan, T0. There’s not a lot, but still at least a few good zero shot models in both architectures.
It's because you need to mess with embeddings or even train new heads on top of a network to use it. LLMs just use tokens-in tokens-out, they don't classify with softmax over classes, they softmax over vocabulary tokens. LLMs are more convenient
What happened is that "transformers go whrrrrrr." (yes, that's the academic term)
In the end, LLMs using causal language modeling or masked language modeling learn to best solve their objectives by creating an efficient global model of language patterns: CLM is actually a harder problem to solve since MLM can leak information through surrounding context, and with transformer scaling law research post-BERT/GPT it's not a surprise CLM won out in the long run.
I believe many high-quality embedding models are still based on BERT, even recent ones, so I don't think it's entirely fair to characterize it as "deprecated".
feels like large language models sucked all the air out of the room because it was a lot easier to scale compute and data, and after roberta, no one was willing to continue exploring.
No, there are mathematical reasons LLMs are better. They are trained with multiobjective loss (coding skills, translation skills, etc) so they understand the world much better than MLM. Original post discuss that but with more words and points than necessary.
> It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio.
Can someone explain this to me? I'm not sure how the compute costs are the same between the 2N and N nets.
You can break your sequence into two parts. One part goes through the encoder and the other goes through the decoder, so each token only goes through one transformer stack.
I think it's because most of the compute comes from the decoding, since you're doing it autoregressively, while the encoder you just feed it through once and get the embedding. So really all it's saying is that the decoder, with N parameters, is the compute bottleneck; hence encoder-decoder with N+N has similar order compute cost as decoder with N.
I mean, the scaling already happened in 2019 with RoBERTa, my guess is that these models are already good enough at what they need to do (creating meaningful text embeddings), and making them extremely large wasn't feasible for deployment.
For text classification/clustering/retrieval I am pretty happy with BERT-family models. It's only the last few month that I've seen better models come out that are practical (e.g. not sell all your children to Open AI to afford them)
For my recommender/object sorter I have not been in a hurry to upgrade because I have other things to think about. This table should give you some idea of the time-space-accuracy trade offs
In a lot of cases you will see two models with a huge difference in size but a tiny difference in accuracy. I could fit either the big or small Stella on my 4080.
* without already labelled training data (assuming you're referring to causal LLMs).
If you have labelled training data (or semi-labelled), BERT takes the cake, both in terms of accuracy and efficiency. In fact, you can have luck with getting a CLM to generate noisy labels and then training BERT/RoBERTa on that to get a robust strong classifier.
Between my experience and arXiv papers I've read I'd say this:
Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.
You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.
You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.
My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.
I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.
You're better off trying to figure out what features you like about articles that are less ambiguous as signal. You would then be able to finetune models to classify whether those features are present. Whether it's classificaton of chunks/sentences/tokens. For these, a bert model could be fine tuned to efficiently detect it.
nit: I find the writing in this post very distracting. (Grammar and style pet peeves)
Luckily, it is now trivial to drop the post into Claude and say "Re-write this without <list of things that bother me>"
So, just in case you also felt like you were driving over a road filled with potholes trying to read this post, don't just click away, have your handy LLM take a pass at it. There's good stuff to be found.
I have a small Swiss army collection of custom BERT fine tunes that are equal or better than the best LLM and execute document classification tasks in 2.4ms. Find me an LLM that can do anything in 2.4ms.