DeepSpeed [1] is amazing tool to enable the different kind of parallelisms and optimizations on your model. I would definitely not recommend reimplementing everything yourself.
Probably FairScale [2] too, but never tried it myself.
Really depends on what you mean by large. If you mean truly large, you will need a cluster to train it in any reasonable amount of time. You’d probably want to look at servers built on the HGX platform (8 A100s per server). We use servers leased in bulk from traditional server providers (think Dell, HP, etc). If you mean more like “as large as personally affordable”, then you’d probably want to look at something like the RTX 3090, if you can get lucky and find it at MSRP, it has 24 gigs of memory. Nvidia also has their workstation cards with up to 48 gigs if I remember correctly, but if I were buying cards for myself, I would wait until I could get two 3090s somewhere close to MSRP, instead of paying the markup on the workstation cards (unless you want to have more than 2 in a workstation, in which case you’d need to go for those)
You can tweak the power limit settings for your application. In many cases you can drop the power consumption (and heat generated) while still maintaining > 90% performance but this will depend on your actual use case [0].
In my experience for many models you can reduce the power limit even further than what has been tested in this guide while barely impacting performance.
I use 2x3090 to train large language models, and mine don't thermal-throttle with air cooling even though they're right next to each other. Eth mining does generate too much heat though.
Probably FairScale [2] too, but never tried it myself.
[1]: https://github.com/microsoft/DeepSpeed
[2]: https://github.com/facebookresearch/fairscale