This reminds me of LLM pretraining and how there are so many points at which the...

This reminds me of LLM pretraining and how there are so many points at which the program could fail and so you need clever solutions to keep uptime high. And it's not possible to just fix the bugs--GPUs will often just crash (e.g. in graphics, if a pixel flips the wrong color for a frame, it's fine, whereas such things can cause numerical instability in deep learning so ECC catches them). You also often have a fixed sized cluster which you want to maximize utilization of.

So improving uptime involves holding out a set of GPUs to swap out failed ones while they reboot. But also the whole run can just randomly deadlock, so you might solve that by listening to the logs and restarting after a certain amount of inactivity. And you have to be clever with how to save/load checkpoints, since that can start to become a huge bottleneck.

After many layers of self healing, we managed to take a vacation for a few days without any calls :)