Hacker News new | past | comments | ask | show | jobs | submit login

I don't work at Internet Archive, but I've found Heritrix is great. Crawling is all about crazy wonky edge cases and it deals with lots of them.

Use the version 1.x series, not the new version 2. 1.x is less convoluted and easier to use. There might be some http://en.wikipedia.org/wiki/Second-system_effect I think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: