Hacker News new | past | comments | ask | show | jobs | submit login

Some issues that appeared over the years:

Block outgoing connects to local IP nets in your firewall. Otherwise your hosting provider might think you are trying to hack them. Apparently there are a lot of links out there that point to hosts which resolve to private IP ranges.

Another problem with following links is that you are bound to run across some that are malware command & control servers. Had several complaints to my ISP after authorities took over control of one and used the C&C server's domain as a honeypot. My crawler is on a whitelist now.

I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)

Make sure your robots.txt parsing is working correctly. I had an undiscovered bug in the software at some time which basically caused it to think everything is allowed. Luckily someone was nice enough to let me know. And he was really nice about it. And he would have had every right to be angry.

A major bottleneck is DNS queries. Run your own DNS server and even cache the hostname/IP pairs yourself. Do not even think about using your IPS's DNS server. If you bombard them with 100+ DNS requests/s then they WILL be angry. :)




> Run your own DNS server and even cache the hostname/IP pairs yourself.

This[1] might be a useful resource to get started:

[1] https://scans.io/

(Register and download the IPv4 Address Space data file to use as an initial cache and then append/update as you go.)


Bookmarked. Thanks!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: