Hacker News new | past | comments | ask | show | jobs | submit login

The AWS premium for GPU instances is absolutely not worth it. You don't hear about people running local GPU compute clusters because it's not newsworthy -- it's obvious. Put a few workstations behind a switch, fire up torch.distributed, you're done. And after two months, you've beaten the AWS spend for the same amount of GPU compute, even if only 50% of the time is spent training. Timesharing is done with shell accounts and asking nicely. You do not need the huge technical complexity of the cloud: it gets in the way, as well as costing more!



What if you want 10x the GPU for one month to build a model?


That's the only scenario I can think where it comes out clearly in favor of AWS - you've tested your model in the small on Colab, you're confident you'll need only a few training runs, you can schedule them in us-east, and you can inference on CPU, and you won't need to rebuild for another eight months (when purchased cards become outdated).

It's not an impossible scenario... But imagine the sort of company that trains their own model instead of using a huggingface refinement or an off-the-shelf redistillation. (These can be done reasonably on an average gaming PC, no need for a cluster.) Such a company has expensive human resources. They bothered to get a data scientist and at least a research engineer, if not a full researcher. Were they hired on six-month contract as well? This is a huge expense, so it must be an important differentiator to have built a custom model -- and it's a one-and-done? I don't see it. I think it's going to be an ongoing project, or it shouldn't have been approved in the first place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: