Author here— Yes, this is a great question! I think there's a lot more interesting work to be done here. It would be great if we could understand e.g. why layer-wise learning rates help with large-batch training of ImageNet (https://arxiv.org/abs/1708.03888), and maybe the per-component noise scale has something to do with it.