This is also bad because the risk of AI "inbreeding" is real. I have seen invisi...

sebzim4500 · on March 30, 2023

Worst case scenario they just start only training on pre-2020 data and then finetuning on a dataset which they somehow know to be 'clean'.

In practice though I doubt that AI contamination is actually a problem. Otherwise how would e.g. AlphaZero work so well (which is effectively only trained on its own data).

whimsicalism · on March 30, 2023

The parallels with AlphaZero are not so easy.

The problem is you need some sort of arbiter of who has "won" a conversation but if the arbiter is just another transformer emitting a score, the models will compete to match the incomplete picture of reasoning given by the arbiter.

sebzim4500 · on April 1, 2023

We have that though, just train on the buzzfeed articles that get the most attention or the tweets that get the most likes, etc.

brucethemoose2 · on March 30, 2023

It could degrade the model in a way that avoids the metrics they use for gauging quality.

The distortions that showed up in ESRGAN (for instance) didnt seem to effect the SSIM or anything (and in fact it was training with MS SSIM loss), but the "noise splotches" and "swirlies" as I call them were noticable in some of the output, but you have to go back and look really hard at the initial dataset to spot what it was picking up. Sometimes, even after cleaning, it felt like what it was picking up on was completely invisible.

TLDR Google may not even notice the inbreeding until its already a large issue, and they may be reluctant to scrap so much work on the model.