Microsoft’s New MT-DNN Outperforms Google BERT

mr_toad · on Feb 17, 2019

If we have vendors touting benchmarks, it can’t be long before Gartner starts putting ML models in magic quandrants.

solomatov · on Feb 17, 2019

They aren't selling anything. It's just the progress in this area is so fast that we have breakthroughs so often.

ousta · on Feb 17, 2019

it will indeed happen. many companies are selling models like apple is selling apps

solomatov · on Feb 16, 2019

Their model is based on BERT, so essentially it's a BERT fine tuned in a novel way.

riku_iki · on Feb 17, 2019

And BERT is OpenAI transformer, finetuned in a novel way, and OpenAI transformer is Tensor2Tensor transformer finetuned in a novel way )

solomatov · on Feb 17, 2019

To summarize the achievements:

* Attention is all you need transformer created a non recurrent architecture for NMT (https://arxiv.org/abs/1706.03762)

* OpenAI GPT modified the original transformer by changing architectutre (one net instead of encoder/decoder pair), and using different hyperparameters which seems to work the best (https://s3-us-west-2.amazonaws.com/openai-assets/research-co...)

* BERT used GPT's architecture but trained in a different way. Instead of training a language model, they forced the model predict holes in a text and predicting whether two sentences go one after another. (https://arxiv.org/abs/1810.04805)

* OpenAI GPT2 achieved a new state of the art in language models (https://d4mucfpksywv.cloudfront.net/better-language-models/l...)

* The paper in the top post found out that if we fine tune several models in the same way as in BERT, we get improvement in each of the fine tuned models.

riku_iki · on Feb 17, 2019

Also:

* OpenAI GPT adapted idea of fine-tuning of language model for specific NLP task, which has been introduced in ELMo model.

* BERT created bigger model (16 layers in GPT vs 24 layers in BERT), proving that larger Transformer models increase performance

phowon · on Feb 17, 2019

The BERT paper also introduced BERT Base, with is 12 layers with approximately the same number of parameters as GPT, but still outperforms GPT on GLUE.

solomatov · on Feb 17, 2019

>OpenAI GPT adapted idea of fine-tuning of language model for specific NLP task, which has been introduced in ELMo model.

Idea of transfer learning of deep representations for NLP tasks was before, but nobody was able to achieve it before ELMo.

If we are pedantic we can include the whole word2vec stuff. It's a shallow transfer learning.

riku_iki · on Feb 17, 2019

Yeah, but in case of ELMo it was fine-tuning (training of pretrained language model and task model together), not just transfer learning.

phowon · on Feb 17, 2019

With ELMo, the pretrained weights are frozen. Only the scalars for ELMo layers are tuned (as well as the additional top-level model, of course).

TaylorAlexander · on Feb 16, 2019

Compare also to OpenAI’s recent post on their own NLP work.

https://blog.openai.com/better-language-models/

buboard · on Feb 16, 2019

they buried that model with their meme-like release

mlevental · on Feb 16, 2019

what exactly is the comparison?

sp332 · on Feb 16, 2019

Is it weird I can't find anything on a Microsoft domain about this? Why does only Synced Review have this news?

yorwba · on Feb 16, 2019

The paper has been on arXiv for a while already. https://arxiv.org/abs/1901.11504

Maybe Microsoft doesn't feel like making a big splash about their research?

manojlds · on Feb 16, 2019

Probably the official sources are still preparing the write up and code / pretrained models release.

dav43 · on Feb 17, 2019

I’m interested in this space and the technical aspects. However, can someone enlighten me as to specific problems these models solve? Any real world implementations? Eg we had this problem, we used this tool, these are the outcomes?

danielcampos93 · on Feb 17, 2019

Document ranking, question answering, caption creation, keyword generation. Basically these model are super useful to search engines or any kind of engine that performs some kind of reasoning with text.

tedivm · on Feb 17, 2019

These models are really good at translating from one language to another and summarizing text.

solomatov · on Feb 17, 2019

Transformer model which this paper was baed on was used in Starcraft 2 bot by DeepMind.

aboutruby · on Feb 17, 2019

Link to the paper: https://arxiv.org/pdf/1901.11504.pdf

braindead_in · on Feb 16, 2019

Was the code released?

pacala · on Feb 16, 2019

"Microsoft will release the code and pre-trained models.", though there is no pointer to where the release will happen. Training gargantuan language models is getting quite expensive, so releasing code + pre-trained models is significant.

The architecture is a derivation of PyTorch BERT [0], with an MTL loss function on top.

[0] https://github.com/huggingface/pytorch-pretrained-BERT

solomatov · on Feb 17, 2019

The link which you shared is an excellent pytorch based library which you can use via pip.

pronoiac · on Feb 17, 2019

Huh. The name "Hugging Face" first brought to mind the facehuggers from Alien.

phowon · on Feb 16, 2019

It's relatively small modification of BERT with multi-task fine-tuning and slightly different output heads. It should be easy for any NLP researcher to replicate.

riku_iki · on Feb 16, 2019

except you need significant GPU/TPU resources to pretrain language model.

bitL · on Feb 16, 2019

You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(

riku_iki · on Feb 16, 2019

TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.

solomatov · on Feb 17, 2019

You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)

bitL · on Feb 17, 2019

Technically correct ;-)

solomatov · on Feb 17, 2019

It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.

solomatov · on Feb 17, 2019

The authors of the paper didn't pretrain the language model. They used an existing BERT and fine tuned it in a novel way.

riku_iki · on Feb 17, 2019

Could you provide citation? I tried to find this but couldn't.

solomatov · on Feb 17, 2019

>The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning. The pretraining stage follows that of the BERT model (Devlin et al., 2018). The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence pre- diction.3

and this:

>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.

You won't be able to train BERT in 3 epochs.

solomatov · on Feb 17, 2019

Here's the quote from BERT:

>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.

danielcampos93 · on Feb 17, 2019

Can confirm from conversations I had with the authors.