Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Right, but training an LLM on the output of another LLM can certainly exacerbate these issues


Maybe, but we are fast approaching the point (or more likely have crossed it already) where distinguishing between human and AI generated data isn't really possible. If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training? Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that of other LLMs)?


> If Google indexes a blog, how does it know whether it was written with AI assistance and therefore should not be used for training

Yes, this is an existential problem for Google and training future LLMs.

See also, https://www.theverge.com/23642073/best-printer-2023-brother-... and https://searchengineland.com/verge-best-printer-2023-394709


Or Google can just materialize the expected page into existence at search time.

... it's uncanny how it always finds what you thought you were looking for!


<meta name="generator" content="human brain">

I'm only half joking.... I think we likely will end up with flags for human generated/curated content (and it will have to be that way round, as I can't imagine spammers bothering to put flags on AI-generated stuff), and we probably already should have an equivalent of robots.txt protocol that allows users to specify which parts of their website they would and wouldn't like used in the training of LLMs.


If content with a "human-generated" flag is rated more highly in some way -- e.g. search results -- then of course spammers will automatically add that flag to their AI-generated garbage. How do you propose to prevent them?


I assume, like the actual meta generator tags, it wouldn't actually be a massive boon for regular search results


And if it's not: why bother.


I think something like this will definitely happen, and your suggestion is the cleanest implementation idea I've seen for it. I imagine there will be a service provided by Google and OpenAI where they verify your identity as a human and then grant you a token to put into your meta tags (wait a second... this sounds like sama's worldcoin idea...).

It will need to be based somewhat on the honor system (just because someone's proved they're a human doesn't mean they won't put their attestation on auto-generated text), but it definitely sounds better than nothing.

They'll still need to incentivize it somehow, though. Why do I as a human want to add that meta tag? If the answer is "better search ranking" then it renders the whole scheme mostly pointless because obviously spammers will want to acquire the attestation and attach it to their auto-generated content.


Reminds me of the old “evil bit” RFC[1]

[1] https://www.ietf.org/rfc/rfc3514.txt


> Heck, how does OpenAI itself prevent such a feedback loop from its own output (or that or other LLMs)?

Seems trivial. Only use old data for the bulk? Feed some new data carefully curated?


Future job: token selector / archiving


Pre-AI data is going to become like pre-nuclear steel.


Your argument would have a lot more force if we were past that point rather than fast approaching that point. Concerns about training data errors being compounded are much more important when you're talking about the bleeding edge.

And your question about how OpenAI prevents their training data from being corrupted is one we should be asking as well!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: