How to train large models on many GPUs?

Voloskaya · on Sept 27, 2021

DeepSpeed [1] is amazing tool to enable the different kind of parallelisms and optimizations on your model. I would definitely not recommend reimplementing everything yourself.

Probably FairScale [2] too, but never tried it myself.

[1]: https://github.com/microsoft/DeepSpeed

[2]: https://github.com/facebookresearch/fairscale

sisjohn · on Sept 27, 2021

Any suggestions on what GPU to use to train large models?

atty · on Sept 27, 2021

Really depends on what you mean by large. If you mean truly large, you will need a cluster to train it in any reasonable amount of time. You’d probably want to look at servers built on the HGX platform (8 A100s per server). We use servers leased in bulk from traditional server providers (think Dell, HP, etc). If you mean more like “as large as personally affordable”, then you’d probably want to look at something like the RTX 3090, if you can get lucky and find it at MSRP, it has 24 gigs of memory. Nvidia also has their workstation cards with up to 48 gigs if I remember correctly, but if I were buying cards for myself, I would wait until I could get two 3090s somewhere close to MSRP, instead of paying the markup on the workstation cards (unless you want to have more than 2 in a workstation, in which case you’d need to go for those)

blackbear_ · on Sept 27, 2021

Totally depends on your budget. The DGX A100 [1] is quite good if you have a fat wallet

[1] https://www.nvidia.com/en-us/data-center/dgx-a100/

lvl100 · on Sept 27, 2021

2 x 3090FE is the best bang for your buck.

cinntaile · on Sept 27, 2021

Do you need watercooling to keep them from running too hot?

kkielhofner · on Sept 27, 2021

You can tweak the power limit settings for your application. In many cases you can drop the power consumption (and heat generated) while still maintaining > 90% performance but this will depend on your actual use case [0].

In my experience for many models you can reduce the power limit even further than what has been tested in this guide while barely impacting performance.

[0] https://timdettmers.com/2020/09/07/which-gpu-for-deep-learni...

maxwells-daemon · on Sept 27, 2021

I use 2x3090 to train large language models, and mine don't thermal-throttle with air cooling even though they're right next to each other. Eth mining does generate too much heat though.

lvl100 · on Sept 27, 2021

For ML? Nope. I think overheating issues are mostly for mining. I run models and 3D render quite a bit and never ran into problems.