I'd say not holding to the standards of robots.txt and 403-Forbidden is quite ma...

nikcub · on May 7, 2017

> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious

Most hidden services don't publish robots files. The only ones that do are the proxy services (which are hidden services but not usually 'hidden'). The purpose of the proxying is to find, discover and monitor what are usually illegal or malicious services.

I don't think there are legitimate crawlers on hidden services - there are a couple of drug market search engines but they identify themselves outside of robots.txt

It's really difficult to run a large-scale hidden service because of this - you need to be able to throttle or block connections but not based on the inbound circuit. You also need to setup guards (which OP makes no mention of)

> It overloads the network, doesn't crawl and doesn't parse the responses.

It's likely adding those later responses into a crawl queue that is tens of thousands of URLs long.

Overloading the network is unintentional, usually your crawling is throttled by your circuit.

stefantalpalaru · on May 7, 2017

> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious, just not evil or bad. If you build a crawler, you should play nice.

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... :

> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.