* BERT used GPT's architecture but trained in a different way. Instead of training a language model, they forced the model predict holes in a text and predicting whether two sentences go one after another. (https://arxiv.org/abs/1810.04805)
* The paper in the top post found out that if we fine tune several models in the same way as in BERT, we get improvement in each of the fine tuned models.
The BERT paper also introduced BERT Base, with is 12 layers with approximately the same number of parameters as GPT, but still outperforms GPT on GLUE.
I’m interested in this space and the technical aspects. However, can someone enlighten me as to specific problems these models solve? Any real world implementations? Eg we had this problem, we used this tool, these are the outcomes?
Document ranking, question answering, caption creation, keyword generation. Basically these model are super useful to search engines or any kind of engine that performs some kind of reasoning with text.
"Microsoft will release the code and pre-trained models.", though there is no pointer to where the release will happen. Training gargantuan language models is getting quite expensive, so releasing code + pre-trained models is significant.
The architecture is a derivation of PyTorch BERT [0], with an MTL loss function on top.
It's relatively small modification of BERT with multi-task fine-tuning and slightly different output heads. It should be easy for any NLP researcher to replicate.
TPU is also slow, they used pod with 64 TPUs for training BERT.
You probably can achieve similar result using distributed training on multiple GPU machines.
>The training procedure of MT-DNN consists of two stages: pretraining and multi-task fine-tuning. The pretraining stage follows that of the BERT model (Devlin et al., 2018). The parameters of the lexicon encoder and Transformer encoder are learned using two unsupervised prediction tasks: masked language modeling and next sentence pre- diction.3
and this:
>Our implementation of MT-DNN is based on the PyTorch implementation of BERT4. We used Adamax (Kingma and Ba, 2014) as our optimizer with a learning rate of 5e-5 and a batch size of 32. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. Fol- lowing (Liu et al., 2018a), we set the number of steps to 5 with a dropout rate of 0.1. To avoid the exploding gradient problem, we clipped the gradi- ent norm within 1. All the texts were tokenized using wordpieces, and were chopped to spans no longer than 512 tokens.
>We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.