The question that is unanswered, is the logarithmic performance improvement the result of better sampling of the underlying distribution over time, or related to just doing more training with slight variations to effectively regularize the model so it generalizes better? If it's the former, that indicates that we could achieve small models that are every bit as smart as large ones in limited domains, and if that's the case, it radically changes the landscape of what an optimal model architecture is.
I suspect from the success of Phi3 that it is in fact the former.
That performance vs. training data is not linear, but logarithmic, doesn't exactly come as a surprise.