The Common Crawl is going to become increasingly contaminated with LLM output and training data that is more likely to have less LLM output will become more valuable.
I see this misconception all the time. Filtering out LLM slop is not much different than filtering out human slop. If anything, LLM generated output is of higher quality that a lot of human written text you'd randomly find on the internet. It's no coincidence that state-of-art LLMs increasingly use more and more synthetic data generated by LLMs themselves. So, no, just because training data was produced by a human doesn't make it inherently more valuable; the only thing that matters is the quality of the data, and the Internet is full of garbage which you need to filter out one way or another.
But the signals used to filter out human garbage are not the same the signals that would be needed to filter LLM garbage. LLMs generate texts that look high-quality at a glance, but might be factually inaccurate. For example, an LLM can generate a codebase that is well-formatted, contains docstrings, comments, maybe even tests; but it will use a non-existent library or be logically incorrect.