Hacker News new | past | comments | ask | show | jobs | submit login

You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(



TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.


You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)


Technically correct ;-)


It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: