Where are you getting that from? As far as I can tell, the Chinchilla paper is purely about getting the highest quality from a fixed training budget. Inference is only mentioned a couple of times in passing as a side effect of smaller models, not as the goal nor as an input to the formula. (And just to be clear: the Chinchilla paper was arguing for smaller models trained for longer, while you seem to be saying that they were arguing for larger models since the inference cost is insignificant.)
> I believe in one of the examples they pointed out that even the whole internet inferring with a model for years was still only 20% of the cost of training that model.
Yeah, this is correct and I'm not sure what paper GP was thinking of – Chinchilla is only about finding the point at which it would be more useful to scale the model rather than training longer.
Chinchilla optimal scaling is not useful if you want to use the model, just if you want to beat some other model on some metric for the minimal training costs.
Where are you getting that from? As far as I can tell, the Chinchilla paper is purely about getting the highest quality from a fixed training budget. Inference is only mentioned a couple of times in passing as a side effect of smaller models, not as the goal nor as an input to the formula. (And just to be clear: the Chinchilla paper was arguing for smaller models trained for longer, while you seem to be saying that they were arguing for larger models since the inference cost is insignificant.)
> I believe in one of the examples they pointed out that even the whole internet inferring with a model for years was still only 20% of the cost of training that model.
I do not see any such example in the paper