Hacker News new | past | comments | ask | show | jobs | submit login

I don't think that comparison is valid, and in fact, actually comparing them shows how reasonable it is: the HHGtG example is egregious because it is imposed silently, long after the fact, made deliberately invisible and hard to access, and discoverable only after the fact. All of those are false for robots.txt and Common Crawl. These are well-known, easy, old protocols which long predate most of the websites in question, which is completely disanalogous to the HHGtG example. Specifically: robots.txt precedes pretty much every website in existence. It's not some last-minute addition tacked on. Further, it is straightforward: you can deny scraping to everyone with a simple 'User-agent: * / Disallow: /' or nofollow headers (also 1 line in a web server like Apache or nginx) - hardly burdensome, and it rules out all projects, not just Common Crawl. Common Crawl is itself, incidentally, 15 years old, and long predates many of the websites it crawls, its crawler operates in the open with a clear user-agent and no shenanigans, and you can further look up what's in it because it's public. (This is how I know Twitter isn't in it: when people claimed GPT-3 was stealing answers from Twitter, I could just go check.) It is also well known, even many non-webmaster web users know about it because it governs what you'll see in search engines, what will be downloaded by some agents like wget by default, is covered early on in website materials, and so on.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: