Hacker News new | past | comments | ask | show | jobs | submit login
How to train large models on many GPUs? (lilianweng.github.io)
108 points by picture on Sept 27, 2021 | hide | past | favorite | 9 comments



DeepSpeed [1] is amazing tool to enable the different kind of parallelisms and optimizations on your model. I would definitely not recommend reimplementing everything yourself.

Probably FairScale [2] too, but never tried it myself.

[1]: https://github.com/microsoft/DeepSpeed

[2]: https://github.com/facebookresearch/fairscale


Any suggestions on what GPU to use to train large models?


Really depends on what you mean by large. If you mean truly large, you will need a cluster to train it in any reasonable amount of time. You’d probably want to look at servers built on the HGX platform (8 A100s per server). We use servers leased in bulk from traditional server providers (think Dell, HP, etc). If you mean more like “as large as personally affordable”, then you’d probably want to look at something like the RTX 3090, if you can get lucky and find it at MSRP, it has 24 gigs of memory. Nvidia also has their workstation cards with up to 48 gigs if I remember correctly, but if I were buying cards for myself, I would wait until I could get two 3090s somewhere close to MSRP, instead of paying the markup on the workstation cards (unless you want to have more than 2 in a workstation, in which case you’d need to go for those)


Totally depends on your budget. The DGX A100 [1] is quite good if you have a fat wallet

[1] https://www.nvidia.com/en-us/data-center/dgx-a100/


2 x 3090FE is the best bang for your buck.


Do you need watercooling to keep them from running too hot?


You can tweak the power limit settings for your application. In many cases you can drop the power consumption (and heat generated) while still maintaining > 90% performance but this will depend on your actual use case [0].

In my experience for many models you can reduce the power limit even further than what has been tested in this guide while barely impacting performance.

[0] https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...


I use 2x3090 to train large language models, and mine don't thermal-throttle with air cooling even though they're right next to each other. Eth mining does generate too much heat though.


For ML? Nope. I think overheating issues are mostly for mining. I run models and 3D render quite a bit and never ran into problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: