The 350M model I trained last night was 30B tokens, 14 hours, ~$200.
Conveniently, 300B is exactly 10X the tokens so ~$2K would be the estimate. You'd have to wait 140 hours on one box though. Getting an H100 box instead of A100 will already cut the time latency down probably by a factor of 2-3X, for free, even without going to fp8 (which we do plan to support).
So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.
So TLDR at this model scale, llm.c is already there functionally, I think, it's a matter of the compute resources and patience. I currently have this one box from Lambda and I have to look around for a few more boxes and merge the pending PR for multi-node training support. Getting all of this into a nice, stable state is probably a good chunk of the pending work right now.