Access to proprietary training data: Search, YouTube, Google Books might give so...

maxloh · 2025-04-12T11:59:49 1744459189

We have Common Crawl, which is also scraped web data for training LLMs, provided for free by a non-profit.

UltraSane · 2025-04-12T13:15:14 1744463714

The Common Crawl is going to become increasingly contaminated with LLM output and training data that is more likely to have less LLM output will become more valuable.

kouteiheika · 2025-04-12T18:58:28 1744484308

I see this misconception all the time. Filtering out LLM slop is not much different than filtering out human slop. If anything, LLM generated output is of higher quality that a lot of human written text you'd randomly find on the internet. It's no coincidence that state-of-art LLMs increasingly use more and more synthetic data generated by LLMs themselves. So, no, just because training data was produced by a human doesn't make it inherently more valuable; the only thing that matters is the quality of the data, and the Internet is full of garbage which you need to filter out one way or another.

empiko · 2025-04-12T21:36:48 1744493808

But the signals used to filter out human garbage are not the same the signals that would be needed to filter LLM garbage. LLMs generate texts that look high-quality at a glance, but might be factually inaccurate. For example, an LLM can generate a codebase that is well-formatted, contains docstrings, comments, maybe even tests; but it will use a non-existent library or be logically incorrect.

UltraSane · 2025-04-13T02:32:04 1744511524

LLM output is uniquely harmful because LLMs trained on LLM output are subject to model collapse

https://www.nature.com/articles/s41586-024-07566-y

SXX · 2025-04-12T19:25:52 1744485952

Problem with filtering is that LLMs can generate few orders of magnitude more slop than humans.

hdjjhhvvhga · 2025-04-12T12:16:09 1744460169

Are the differences between Google Books and LibGen documented anywhere? I believe most models outside of Google are trained on the latter.