Hacker News new | past | comments | ask | show | jobs | submit login

I would highly recommend storing the working set of links in RAM (with checkpointing to write it out to disk periodically). A Redis Set (for visited links) + Sorted Set (for unvisited links, ordered by priority) is perfect for this, since it lets you take up one full machine's RAM and does checkpointing automatically. If your crawl is too big to fit in RAM, get more machines and shard by URL hash. As others have pointed out, the file content itself should go in files, ideally ones that you can write to with straight appends.

The reason you don't want to hit the disk with each link (as both MySQL and PostGres usually do, barring caching) is that there can be hundreds to thousands of links on a page. A disk hit takes ~10ms; if you need to run hundreds of those, it's well over a second per page just to figure out which links on it are unvisited. Accessing main memory is about 100,000 times faster; even with sharding and RPC overhead for a distributed memory cache, you end up way ahead.

The reason to write the crawl text to an append-only log file is because disk seek times are bound by the rotation speed of the disk, which hasn't changed much recently, while disk bandwidth is bound by the rotation time of the disk divided by capacity, which has gone way up. So appends are much more efficient on disk than seeks are.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: