Right, but training an LLM on the output of another LLM can certainly exacerbate...

paxys · on March 30, 2023

Maybe, but we are fast approaching the point (or more likely have crossed it already) where distinguishing between human and AI generated data isn't really possible. If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training? Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that of other LLMs)?

madeofpalk · on March 30, 2023

> If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training

Yes, this is an existential problem for Google and training future LLMs.

See also, https://www.theverge.com/23642073/best-printer-2023-brother-... and https://searchengineland.com/verge-best-printer-2023-394709

ethbr0 · on March 31, 2023

Or Google can just materialize the expected page into existence at search time.

... it's uncanny how it always finds what you thought you were looking for!

notahacker · on March 30, 2023

I'm only half joking.... I think we likely will end up with flags for human generated/curated content (and it will have to be that way round, as I can't imagine spammers bothering to put flags on AI-generated stuff), and we probably already should have an equivalent of robots.txt protocol that allows users to specify which parts of their website they would and wouldn't like used in the training of LLMs.

jfk13 · on March 30, 2023

If content with a "human-generated" flag is rated more highly in some way -- e.g. search results -- then of course spammers will automatically add that flag to their AI-generated garbage. How do you propose to prevent them?

notahacker · on March 30, 2023

I assume, like the actual meta generator tags, it wouldn't actually be a massive boon for regular search results

groestl · on March 31, 2023

And if it's not: why bother.

chatmasta · on March 30, 2023

I think something like this will definitely happen, and your suggestion is the cleanest implementation idea I've seen for it. I imagine there will be a service provided by Google and OpenAI where they verify your identity as a human and then grant you a token to put into your meta tags (wait a second... this sounds like sama's worldcoin idea...).

It will need to be based somewhat on the honor system (just because someone's proved they're a human doesn't mean they won't put their attestation on auto-generated text), but it definitely sounds better than nothing.

They'll still need to incentivize it somehow, though. Why do I as a human want to add that meta tag? If the answer is "better search ranking" then it renders the whole scheme mostly pointless because obviously spammers will want to acquire the attestation and attach it to their auto-generated content.

shubhamkrm · on March 30, 2023

Reminds me of the old “evil bit” RFC[1]

[1] https://www.ietf.org/rfc/rfc3514.txt

rightbyte · on March 30, 2023

> Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that or other LLMs)?

Seems trivial. Only use old data for the bulk? Feed some new data carefully curated?

toxik · on March 30, 2023

Future job: token selector / archiving

p1necone · on March 31, 2023

Pre-AI data is going to become like pre-nuclear steel.

abduhl · on March 30, 2023

Your argument would have a lot more force if we were past that point rather than fast approaching that point. Concerns about training data errors being compounded are much more important when you're talking about the bleeding edge.

And your question about how OpenAI prevents their training data from being corrupted is one we should be asking as well!