Speaking of AI generated pages, I wonder how OpenAI filter these low quality web...

wwweston · on April 5, 2023

> I wonder how OpenAI filter these low quality web pages out of their training set as they continue to training.

This. The value proposition is very clearly tied to the quality of the training data, and if there's secret sauce for automatically determining information quality that's obviously huge. Google was built in part on such insights. I suspect they do have something. I'd be utterly astonished if quality sorting were an emergent property of LLMs (especially given it's iffy in humans).

The problem, of course, is that if they do have a way of privileging data for training, that information is going to be the center of the usual arms race for attention and thinking. It can't be truly public or it's dead.

jazzyjackson · on April 3, 2023

yea i'm kind of shocked none of these models implement any kind of fingerprinting, something encoded in zero width spaces or other invisible unicode. It would be trivial to delete it but for the vast majority of cases, it would allow content to be flagged as model output-do not ingest

lupire · on April 3, 2023

If they aren't using Bing as a quality filter, they are crazy or stupid.