You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU i...

riku_iki · on Feb 16, 2019

TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.

solomatov · on Feb 17, 2019

You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)

bitL · on Feb 17, 2019

Technically correct ;-)

solomatov · on Feb 17, 2019

It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.