Hacker News new | past | comments | ask | show | jobs | submit login

I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious, just not evil or bad. If you build a crawler, you should play nice. But bot A-D were easily discouraged.

Eddie however is another problem. It overloads the network, doesn't crawl and doesn't parse the responses. This is not crawler behaviour...

The rest of the post is solid inductive reasoning (from my perspective): the bot is identifiable by his behaviour. It has a faster response time that a source-relay-source roundtrip. Thus the bot must originate there.

This is supported that the anonymous relays were set up just before the attack, all at the same time. And after the attack stopped the majority of all traffic through the relays stopped.

There are also ways to keep your registration private without resorting to fraud. Though probably a number of people think of this as the 'easy' solution.




> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious

Most hidden services don't publish robots files. The only ones that do are the proxy services (which are hidden services but not usually 'hidden'). The purpose of the proxying is to find, discover and monitor what are usually illegal or malicious services.

I don't think there are legitimate crawlers on hidden services - there are a couple of drug market search engines but they identify themselves outside of robots.txt

It's really difficult to run a large-scale hidden service because of this - you need to be able to throttle or block connections but not based on the inbound circuit. You also need to setup guards (which OP makes no mention of)

> It overloads the network, doesn't crawl and doesn't parse the responses.

It's likely adding those later responses into a crawl queue that is tens of thousands of URLs long.

Overloading the network is unintentional, usually your crawling is throttled by your circuit.


> I'd say not holding to the standards of robots.txt and 403-Forbidden is quite malicious, just not evil or bad. If you build a crawler, you should play nice.

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea... :

> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: