Hacker News new | past | comments | ask | show | jobs | submit login

Access to proprietary training data: Search, YouTube, Google Books might give some moat.





We have Common Crawl, which is also scraped web data for training LLMs, provided for free by a non-profit.

The Common Crawl is going to become increasingly contaminated with LLM output and training data that is more likely to have less LLM output will become more valuable.

I see this misconception all the time. Filtering out LLM slop is not much different than filtering out human slop. If anything, LLM generated output is of higher quality that a lot of human written text you'd randomly find on the internet. It's no coincidence that state-of-art LLMs increasingly use more and more synthetic data generated by LLMs themselves. So, no, just because training data was produced by a human doesn't make it inherently more valuable; the only thing that matters is the quality of the data, and the Internet is full of garbage which you need to filter out one way or another.

But the signals used to filter out human garbage are not the same the signals that would be needed to filter LLM garbage. LLMs generate texts that look high-quality at a glance, but might be factually inaccurate. For example, an LLM can generate a codebase that is well-formatted, contains docstrings, comments, maybe even tests; but it will use a non-existent library or be logically incorrect.

LLM output is uniquely harmful because LLMs trained on LLM output are subject to model collapse

https://www.nature.com/articles/s41586-024-07566-y


Problem with filtering is that LLMs can generate few orders of magnitude more slop than humans.

Are the differences between Google Books and LibGen documented anywhere? I believe most models outside of Google are trained on the latter.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: