Sure you probably don’t want to do full training runs locally, but There’s a lot you can do locally that has a lot of added friction on a gpu cluster or other remote compute resource
I like to start a new project by prototyping and debugging my training and cunning config code, setting up the data loading and evaluation pipeline, hacking around with some baseline models and making sure they can overfit some small subset of my data
After all that’s done it’s finally time to scale out to the gpu cluster. But I still do a lot of debugging locally
Maybe this kind of workflow isn’t as necessary if you have a task that’s pretty plug and play like image classification, but for nonstandard tasks I think there’s lots of prototyping work that doesn’t require hardware acceleration
Coding somewhat locally is a must for me too because the cluster I have access to has pretty serious wait times (up to a couple hours on busy days). Imagine only being able to run the code you’re writing a few times a day at most! Iterative development and doing a lot of mistakes is how I code; I don’t want to go back to punch card days where you waited and waited before you ended up with a silly error.
I like to start a new project by prototyping and debugging my training and cunning config code, setting up the data loading and evaluation pipeline, hacking around with some baseline models and making sure they can overfit some small subset of my data
After all that’s done it’s finally time to scale out to the gpu cluster. But I still do a lot of debugging locally
Maybe this kind of workflow isn’t as necessary if you have a task that’s pretty plug and play like image classification, but for nonstandard tasks I think there’s lots of prototyping work that doesn’t require hardware acceleration