Hacker News new | past | comments | ask | show | jobs | submit login

They changed it to disallow so that scrapers can't just claim the robots.txt gave them permission.



According to the US court systems the robots.txt file is meaningless. If they respond with a 200 status code giving you the access then you can legally scrape it all you want. If they require that you log in then you have to follow the terms you agree to when creating an account. Public means public though, and if Reddit doesn't want to make the content private (put it behind a login) then we can scrape away.

Note that scraping, regardless of the level of permission, doesn't mean you can do anything you want with the content. Copyright still applies. But you can scrape it, and if your use falls under Fair Use or another caveat to the copyright laws then you can do ahead and do it without needing any permission from the authors.


Fascinating. Where can one learn more about this?


I liked the chapter on DMCA from the 5-volume E-Commerce & Internet Law. It was super detailed.

I haven’t read volume 1, but apparently half of it is about data scraping, and I expect it to be similarly detailed. So if I were you, that’s where I’d start.

Another option is looking for “robots.txt” at Google Scholar and trying various keywords like “legality”, “scraping”, “case law”, etc.


The internet.


If you have nothing constructive to say why say anything?


That was my answer, FFS.


Independent scrapers can launder the data between Reddit and AI consumers. The only folks this hurts is users seeking info via search engines and folks willing to kowtow to rules that are potentially low impact to evade. Next steps would be (from an adversarial perspective) browser extensions that stream back data for ingestion similar to Recap for Pacer [1].

[1] https://free.law/recap/faq

(full disclosure: assisting someone pursuing regulatory action against reddit in the EU for a separate issue from scraping, it's a valuable resource, but the folks who own and control it are meh)


Scrapings laundering. Do we have a term for this?


Yes, right in the law - "fair use"


Even more basic, it's free speech. The data itself is public domain so your free speech is not restricted and you don't need fair use excemptions for those restrictions. On the the access through the official system is restricted.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: