Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
bitL
on Feb 16, 2019
|
parent
|
context
|
favorite
| on:
Microsoft’s New MT-DNN Outperforms Google BERT
You can't even train BERT_large on a 12/16GB GPU, and on a single 15TFlops GPU it might take a year to train. GPUs are too slow :-(
riku_iki
on Feb 16, 2019
|
next
[–]
TPU is also slow, they used pod with 64 TPUs for training BERT. You probably can achieve similar result using distributed training on multiple GPU machines.
solomatov
on Feb 17, 2019
|
prev
[–]
You can but it will be really slow. You can load just parts of the model, and store them on the disk/in memory :-)
bitL
on Feb 17, 2019
|
parent
[–]
Technically correct ;-)
solomatov
on Feb 17, 2019
|
root
|
parent
[–]
It's actually isn't that bad. Tensorflow and pytorch has support for it, but the penalty will be quite large.
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search: