I think the concept makes sense. The basic insight, that the right batch size depends on the difficulty and noisiness of a task, is already used by teams. For example, the PaLM paper from last week increased its batch size throughout training.
But as far as I know, the more precise predictions of optimal batch size aren't used much, probably because it's expensive to measure accurately, or because the predictive equation isn't accurate enough to begin with. I wonder if we can "transfer" the optimal batch size from a smaller setting (smaller model or data) to the full setting, like in our paper. This would make it much more practical.
But as far as I know, the more precise predictions of optimal batch size aren't used much, probably because it's expensive to measure accurately, or because the predictive equation isn't accurate enough to begin with. I wonder if we can "transfer" the optimal batch size from a smaller setting (smaller model or data) to the full setting, like in our paper. This would make it much more practical.