Probably drawing an analogy to how causal pretrained models go through stages of understanding language, words -> grammar -> meaning. Gwen mentions this experience when training character level RNNs. https://gwern.net/scaling-hypothesis#why-does-pretraining-wo...
Totally. Also basic temporal scale or cyclic properties. It’s kind of mind blowing that the shape of most recorded human patterns is reducible in this way.