Maybe, but we are fast approaching the point (or more likely have crossed it already) where distinguishing between human and AI generated data isn't really possible. If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training? Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that of other LLMs)?
I'm only half joking.... I think we likely will end up with flags for human generated/curated content (and it will have to be that way round, as I can't imagine spammers bothering to put flags on AI-generated stuff), and we probably already should have an equivalent of robots.txt protocol that allows users to specify which parts of their website they would and wouldn't like used in the training of LLMs.
If content with a "human-generated" flag is rated more highly in some way -- e.g. search results -- then of course spammers will automatically add that flag to their AI-generated garbage. How do you propose to prevent them?
I think something like this will definitely happen, and your suggestion is the cleanest implementation idea I've seen for it. I imagine there will be a service provided by Google and OpenAI where they verify your identity as a human and then grant you a token to put into your meta tags (wait a second... this sounds like sama's worldcoin idea...).
It will need to be based somewhat on the honor system (just because someone's proved they're a human doesn't mean they won't put their attestation on auto-generated text), but it definitely sounds better than nothing.
They'll still need to incentivize it somehow, though. Why do I as a human want to add that meta tag? If the answer is "better search ranking" then it renders the whole scheme mostly pointless because obviously spammers will want to acquire the attestation and attach it to their auto-generated content.
Your argument would have a lot more force if we were past that point rather than fast approaching that point. Concerns about training data errors being compounded are much more important when you're talking about the bleeding edge.
And your question about how OpenAI prevents their training data from being corrupted is one we should be asking as well!