Author here — thanks for your interest! So far we do not know of a way to estimate the noise scale without any training. You can generally get a rough picture of the noise scale from a small fraction of a training run, but the noise scale tends to increase over time, so to get a fully accurate measurement you do need to do at least one full training run. You also need the hyperparameters (primarily the learning rate) to be reasonably well chosen, but the choice of batch size isn’t important — and the batch size is what you’re trying to determine.
We also show that it's possible to compute the noise scale as you train (without any extra overhead), and this can be used to adjust the batch size in real-time, so that in theory you can get it right the first time and in a way that adapts over the training run. However, those experiments are still preliminary (see appendix D of the paper).
Another author here– I'll add to Jared's comment above that for long-running experiments (like the ones our Dota team runs), it can be useful to track this statistic in real time to see whether or not it would be useful to scale up the experiment.
Whats a cheap and unobtrusive way to estimate the BSimple version of the noise scale in real time? Piggy back on ADAM's moving mean and variance estimates? Edit: I see that Appendix A has a method for the multi-device training setting, but I'm thinking of single device training.
We also show that it's possible to compute the noise scale as you train (without any extra overhead), and this can be used to adjust the batch size in real-time, so that in theory you can get it right the first time and in a way that adapts over the training run. However, those experiments are still preliminary (see appendix D of the paper).