Their papers say they were using Common Crawl for crawling. If you didn't want y...

flangola7 · on Aug 7, 2023

That's not consent though. Consent is not granted until explicit stated in the affirmative. Try applying "assume yes initially, until told otherwise" to entering someone's house or touching someone's body and let me know how that works out for you.

6gvONxR4sf7o · on Aug 7, 2023

Opt out != opt in. This reminds me of the beginning of the hitchhikers guide to the galaxy where Dent’s house is being demolished but the notice had been on display in a locked basement below city hall or something. He could have objected, technically!

gwern · on Aug 7, 2023

I don't think that comparison is valid, and in fact, actually comparing them shows how reasonable it is: the HHGtG example is egregious because it is imposed silently, long after the fact, made deliberately invisible and hard to access, and discoverable only after the fact. All of those are false for robots.txt and Common Crawl. These are well-known, easy, old protocols which long predate most of the websites in question, which is completely disanalogous to the HHGtG example. Specifically: robots.txt precedes pretty much every website in existence. It's not some last-minute addition tacked on. Further, it is straightforward: you can deny scraping to everyone with a simple 'User-agent: * / Disallow: /' or nofollow headers (also 1 line in a web server like Apache or nginx) - hardly burdensome, and it rules out all projects, not just Common Crawl. Common Crawl is itself, incidentally, 15 years old, and long predates many of the websites it crawls, its crawler operates in the open with a clear user-agent and no shenanigans, and you can further look up what's in it because it's public. (This is how I know Twitter isn't in it: when people claimed GPT-3 was stealing answers from Twitter, I could just go check.) It is also well known, even many non-webmaster web users know about it because it governs what you'll see in search engines, what will be downloaded by some agents like wget by default, is covered early on in website materials, and so on.