This is also bad because the risk of AI "inbreeding" is real. I have seen invisible artifact amplification happen in a single generation training ESRGAN on itself.
Maybe it wont happen in a single LLM generation, but perhaps gen 3 or 5 will start having really weird speech patterns or hallucinations because of this.
Worst case scenario they just start only training on pre-2020 data and then finetuning on a dataset which they somehow know to be 'clean'.
In practice though I doubt that AI contamination is actually a problem. Otherwise how would e.g. AlphaZero work so well (which is effectively only trained on its own data).
The problem is you need some sort of arbiter of who has "won" a conversation but if the arbiter is just another transformer emitting a score, the models will compete to match the incomplete picture of reasoning given by the arbiter.
It could degrade the model in a way that avoids the metrics they use for gauging quality.
The distortions that showed up in ESRGAN (for instance) didnt seem to effect the SSIM or anything (and in fact it was training with MS SSIM loss), but the "noise splotches" and "swirlies" as I call them were noticable in some of the output, but you have to go back and look really hard at the initial dataset to spot what it was picking up. Sometimes, even after cleaning, it felt like what it was picking up on was completely invisible.
TLDR Google may not even notice the inbreeding until its already a large issue, and they may be reluctant to scrap so much work on the model.
Maybe it wont happen in a single LLM generation, but perhaps gen 3 or 5 will start having really weird speech patterns or hallucinations because of this.