I help teams run transformers in their production systems on CPU, using my product based on ONNX Runtime.
This is a great article, but if you’re using something based on BERT or RoBERTa, you don’t need to do much. Distillation is usually the only step you need to take if you’re really picky, or if your scale is millions of requests per day and you’re not making enough money to support the infrastructure.
I have had mixed results with quantization and sparsification, but IMO it’s just not worth it as they can be unstable.
Using blocks allows to keep good performence on GPUS, while giving some flexibility in the pruning pattern. And when removing entirely empty rows and columns the pruned matrices are actually pretty dense, so competitive with structured pruning for speedup, but less "aggressive" on the network during the pruning process.
Disclaimer: I am the main co-author.
Hey all, I used most of the tactics here to optimise Transformers so y'all don't have to :-)
try https://text-generator.io
It also works with code and in multiple languages and is OpenAI compatible so a one line change. There's a few other things that it does too like linked image understanding.
Are any companies with the largest most capable models doing these things? Maybe OpenAI has used some of them for GPT-4.
But also maybe there is another company using a very large dataset and some optimizations. I would love to have an alternative so I wasn't 100% reliant on OpenAI.
They're trained and focused on language data, actually, not code specifically. There are both generation models and multilingual text embedding models (100+ languages, single model).
The author of this paper runs the inference team at OpenAI. So I would imagine that this is a record of her experiments without necessarily revealing which ones they're actually using in production.
Sure, admittedly I was being overly hyperbolic and a bit snarky.
However I am genuinely curious what sort of industrial "real world task" there is that requires edge inference on GPT3.5 or PaLM-sized models where you would run into this problem and not have the infrastructure therefore requiring these potentially unstable tricks?
The point I was alluding to is that LLMs of this size are overkill for most commercial use cases (e.g. NER, document classification, semantic search, chat bot).
Maybe I'm missing the point of the article then. What's the low-resource scenario where inference speed is the bottleneck for transformer adoption at scale?
New AI tasks are being unlocked by (large-scale) foundation models (Liang, 2022).
Fine-tuning in low-resource (few-shot) scenarios is now possible for many new applications.
However, these new AI applications relied upon a huge pretrained model to get there. Because the old approach of training from scratch on 100 labeled examples didn't work well.
Thus, we want to distill the knowledge so that the model can be deployed in low-resource scenarios.
[edit: I see your below comment about the concern about transformer cost. Agreed. This is one of the many concerns around foundation models that must be understood. The happy path is that training the foundation model is a one-time cost that pays dividends in the many tasks it unlocks. However, you are correct that the research to get there is quite spendy. I encourage you to skim this paper. It's long but very accessible: https://arxiv.org/pdf/2108.07258.pdf
ps the reason commercial use cases all use simple models is not because simple models are ipso facto commercially valuable. It's just that industry practitioners are too overworked to do fancy bleeding edge stuff. Thus, the dirty secrets is that most fancy ML companies are just using logistic regression for everything. Foundation models allow industry practitioners rapidly to train powerful accurate models. The question is, now, how do they deploy them.]
Perhaps I wasn't clear. I fully believe in transformers and foundation models, my criticism is more on creeping model size and whether trying to use huge models is even the right approach for someone seeking to deploy a transformer at scale.
Conveniently, I'm decently familiar with Dr. Liang's work as he has done some really great stuff in the biomedical domain recently. Using his publications as an example and considering my domain (medical), isn't he showing that smaller models with different architectures (such as DRAGON) or pretrained on in-domain text (such as PubMedGPT) are effective ways to get increased performance rather than just simply scaling a more general LLM to unusable sizes?
My experience thus far has been that fine-tuning BERT-large sized models works really well, can be deployed on hospital infrastructure for inference (granted the workstations in the radiology department all have decent GPUs) and doesn't need much in terms of inference optimization.
Perhaps I'm missing something here, appreciate your input.
I appreciate your openness here. Based upon my background, I'll do a little bit of handwaving, so we can read the tea leaves and see where the puck is going, while not overly mixing metaphors.
Smaller models are a stop-gap solution because they are task-specific and can incorporate expert knowledge. The thrust of ML research over the past decade has been consolidation of effort and huge-scale training to replace expert knowledge (or using expert knowledge as micro tasks to condition the huge-scale training). I bet a dollar to a dime that in several years, that these smaller models will be replaced by foundation models that are fine-tuned and possibly distilled, as the field does the following:
* Build foundation models.
* Discover weaknesses and blind-spots.
* Patch them either using more data or micro-tasks.
Fair enough. I guess I'm biased by my working environment and current belief that we're scaling transformer models unnecessarily, but I guess that is also partly influenced by their cost.
This is a great article, but if you’re using something based on BERT or RoBERTa, you don’t need to do much. Distillation is usually the only step you need to take if you’re really picky, or if your scale is millions of requests per day and you’re not making enough money to support the infrastructure.
I have had mixed results with quantization and sparsification, but IMO it’s just not worth it as they can be unstable.