According to the US court systems the robots.txt file is meaningless. If they respond with a 200 status code giving you the access then you can legally scrape it all you want. If they require that you log in then you have to follow the terms you agree to when creating an account. Public means public though, and if Reddit doesn't want to make the content private (put it behind a login) then we can scrape away.
Note that scraping, regardless of the level of permission, doesn't mean you can do anything you want with the content. Copyright still applies. But you can scrape it, and if your use falls under Fair Use or another caveat to the copyright laws then you can do ahead and do it without needing any permission from the authors.
I liked the chapter on DMCA from the 5-volume E-Commerce &
Internet Law. It was super detailed.
I haven’t read volume 1, but apparently half of it is about data scraping, and I expect it to be similarly detailed. So if I were you, that’s where I’d start.
Another option is looking for “robots.txt” at Google Scholar and trying various keywords like “legality”, “scraping”, “case law”, etc.
Independent scrapers can launder the data between Reddit and AI consumers. The only folks this hurts is users seeking info via search engines and folks willing to kowtow to rules that are potentially low impact to evade. Next steps would be (from an adversarial perspective) browser extensions that stream back data for ingestion similar to Recap for Pacer [1].
(full disclosure: assisting someone pursuing regulatory action against reddit in the EU for a separate issue from scraping, it's a valuable resource, but the folks who own and control it are meh)
Even more basic, it's free speech. The data itself is public domain so your free speech is not restricted and you don't need fair use excemptions for those restrictions. On the the access through the official system is restricted.