How AI Training Scales

taliesinb · on Dec 14, 2018

This is an exciting work.

Gradient noise scale would be interesting to compute individually for each weight/bias. I'm curious how it scales as a function of layer depth, for example -- it could well be that different layers require very different treatment in terms of learning rate. Supporting this is that gradient magnitudes are typically very different for different layers, although they evolve similarly in time. Here's something I just whipped up in Mathematica:

https://imgur.com/a/HhCnlDL

samsamoa · on Dec 14, 2018

Author here— Yes, this is a great question! I think there's a lot more interesting work to be done here. It would be great if we could understand e.g. why layer-wise learning rates help with large-batch training of ImageNet (https://arxiv.org/abs/1708.03888), and maybe the per-component noise scale has something to do with it.

yaroslavvb · on Dec 14, 2018

This kind of scaling was rigorously shown for a related metric called "gradient diversity" in https://arxiv.org/abs/1706.05699

samsamoa · on Dec 14, 2018

One of the authors here. Thanks for the comment! Yes, we mention this work and number of others in the blogpost and paper. This isn't the first (or the last) paper on the topic but I think we've clarified the large-batch training situation significantly by connecting gradient noise directly to the speed of training, and by measuring it systematically on a bunch of ML tasks and characterizing its behavior.

TTPrograms · on Dec 14, 2018

Similarly I think the theory developed in the Three Factors paper predicts this scaling law, might be worth citing: https://arxiv.org/abs/1711.04623

jamesblonde · on Dec 14, 2018

It looks like "gradient noise scale" will be a useful hyperparameter for distributed training. However, how do you efficiently compute it without training a network? Do you need to first train the network to compute it to find out the scaling factor for the network? Is there a more efficient way to compute it from a subset of a large dataset?

jareddk · on Dec 14, 2018

Author here — thanks for your interest! So far we do not know of a way to estimate the noise scale without any training. You can generally get a rough picture of the noise scale from a small fraction of a training run, but the noise scale tends to increase over time, so to get a fully accurate measurement you do need to do at least one full training run. You also need the hyperparameters (primarily the learning rate) to be reasonably well chosen, but the choice of batch size isn’t important — and the batch size is what you’re trying to determine.

We also show that it's possible to compute the noise scale as you train (without any extra overhead), and this can be used to adjust the batch size in real-time, so that in theory you can get it right the first time and in a way that adapts over the training run. However, those experiments are still preliminary (see appendix D of the paper).

samsamoa · on Dec 14, 2018

Another author here– I'll add to Jared's comment above that for long-running experiments (like the ones our Dota team runs), it can be useful to track this statistic in real time to see whether or not it would be useful to scale up the experiment.

taliesinb · on Dec 14, 2018

Whats a cheap and unobtrusive way to estimate the BSimple version of the noise scale in real time? Piggy back on ADAM's moving mean and variance estimates? Edit: I see that Appendix A has a method for the multi-device training setting, but I'm thinking of single device training.

sjroot · on Dec 14, 2018

This was a great read made even better by the simple yet high-quality visualizations. OpenAI does a great job of using UX to make their research more accessible, and I'd love to see more researchers follow suit.

cracker_jacks · on Dec 14, 2018

So this blog post posits that there is a growing upper bound to the effectiveness batch size. Is there a similar analysis on checking whether there is also a growing lower bound? Or does batchsize=1 training work for all these tasks?

staticfloat · on Dec 15, 2018

batchsize 1 should always work. The trade off is always “bigger batch size/more efficient” versus “smaller batch size/better results”. By using larger batches you are able to reduce overhead, take better advantage of memory caches, etc.... which can give very large speed ups. But mathematically it can slow down your training convergence (check out the plots showcasing how smaller batches converge in fewer optimizer steps). This is because when you average over an entire batch, you throw away a little bit of information that could otherwise be used to better learn your model. For intuition on this, imagine that you are doing homework, and you are forced to finish 1000 practice problems before being allowed to look at the answers; if you had looked earlier you may have been able to correct your mistake earlier and not gotten all 1000 wrong. However it would have taken you more time to go through those 1000 practice problems because you would have been looking up the answers and mentally adjusting your model for each problem you did.

tarlinian · on Dec 15, 2018

I think you're confused about the benefits/drawbacks of increasing batch size. The training progress per optimizer step increases with batch size (otherwise the smallest batch would be always optimal). Increasing the batch size improves the quality of a single optimizer step but does not scale linearly. (i.e., larger batch training requires more samples to converge than small batch training even though fewer steps are required). Depending on the problem scale the computational efficiency of larger batches makes this tradeoff worth it because even though you need to process more samples to converge, it will take less wall clock time due to improved efficiency.

staticfloat · on Dec 28, 2018

Yes, we're saying the same thing. It takes longer for the optimizer to converge when using larger batch sizes, in terms of number of samples pushed through the model. It takes less time in terms of wall clock time due to increased efficiency.

gwern · on Dec 15, 2018

I think it would be difficult to even test batchsizes of 1 in a reasonable time, assuming there exists a sufficiently small learning rate to allow progress. How many wallclock centuries would it take to verify that a minibatch of 1 works for the OA5 DoTA bot?