How to write a crawler

AcoonDe · on Sept 29, 2013

There are a few things that I hope you just left out in your description...

- There is no mention of implementing a crawl-delay. You should always wait for several seconds (better yet, a minute) between requests to the same host.

- Do you follow redirects when requesting the robots.txt? You should! Some sites send you a redirect to a different URL even for robots.txt. In most cases it is just a slightly different hostname, like www.domain.com instead of domain.com. But it can redirect you to somewhere completely different in some cases.

- You probably don't want to crawl anything that ends with .jpg, .gif, and definitely not something like .avi, .wmv or .mkv. There are a LOT more file-extensions that you'll want to ignore.

I agree with cmiles74 that using a database is probably a bad idea. For a sizeable crawl (say a billion pages) this database will get pretty damn big. I doubt that you will be able to get decent performance out of anything with "SQL" in its name for such a use-case, unless you throw a ton of hardware at it. Building your own specialized solution for this would probably be a lot faster and less resource-intensive.

at-fates-hands · on Sept 29, 2013

>>>You should always wait for several seconds (better yet, a minute) between requests to the same host.

This a thousand times.

I learned this lesson AFTER getting several angry emails from Admins and getting outright banned from one site for not having any delay between the requests in the first crawler I built.

boyter · on Sept 30, 2013

Its also missing some form of lock to prevent slow requests cascading (to the same host or IP) when the next instance of your crawler comes along. I did this once where I had a crawler spin up in a cron job. No problem, unless the previous crawler was still running.

Speaking from experience its far better to lock your crawler down from the beginning then to do it after you kill someones site. I did expose a few slow pages for some sites, but not before crashing the whole thing.

ebol4 · on Sept 30, 2013

>There are a LOT more file-extensions that you'll want to ignore.

It'd probably be a good idea just to whitelist filetypes that you actually want to crawl, rather than trying to blacklist all the ones you don't.

drakaal · on Sept 30, 2013

For http://www.samuru.com we store our crawl in Google's DataStore. It gets big. By not storing all of the page, (we do content extraction first) we save quite a bit. Typically the content has much less markup than the template does.

We extract other things like the Opengraph image and social media links for author and store them independent of the content.

asperous · on Sept 29, 2013

I wanna mention Nutch here: http://nutch.apache.org/ since it has been around for a while and a lot of thought was put into its design. For instance, people are discussing data stores, Nutch uses Hadoop.

The web is probably bigger than you think, Google says "when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!" (July, 2008)

You might consider just crawling certain parts of the web, or using a search engine (api, like Yahoo! BOSS) to gather relevant links and crawl from there, using a depth limit. Just an idea.

KMag · on Sept 30, 2013

I used to be on the Google indexing team. Disregarding limits on the length of URLs, the size of the visible web is already infinite. For instance, there are many calendar pages out there that will happily give you month after month ad infinitum if you keep following the "next" link.

Now, depending on how you prune your crawl to get rid of "uninteresting" content (such as infinite calendars) and how you deduplicate the pages you find, you'll come up with vastly varying estimates of how big the visible web is.

Edit: on a side note, don't crawl the web using a naive depth-first search. You'll get stuck in some uninteresting infinitely deep branch of the web.

EmanueleMinotto · on Sept 30, 2013

You're right, I forgot to write it explicitly in the article but if someone will follow istructions (extract all <a> tags and add them to the index) that method is tacit.

thejosh · on Sept 30, 2013

1 trillion in 2008.... while not that long ago I wouldn't be surprised if this has exploded since then.

Any numbers?

yogo · on Sept 29, 2013

You definitely want to store raw text in flat files and store metadata in a database. By metadata I'm not referring to only the values found in the html page's meta tags but other things like page hash, word count, link count, etc. It all depends on what you are doing. If it's in a database you will have a very very big file for that archives table (mongodb and mysql+innodb come to mind).

rb2k_ · on Sept 30, 2013

What would a filesystem do better than a DB in that case? After all a filesystem is just a database who's primary keys are mostly filenames.

Most of them use the same data-structures (e.g. b+trees) and especially the newer ones turn into copy-on-write systems like e.g. CouchDB

I guess if you want to have backups of that table, you probably would like to keep it at a manageable size, but why not just save the html blobs in a separate table?

riffraff · on Sept 29, 2013

in a couple of projects I worked on, we also stored visited urls in a set of bloom filters, also stored in flat files on disk.

At some point querying the db to check what URLs you have can become quite heavy

cmiles74 · on Sept 29, 2013

I respectfullly disagree. If ever there was a use-case for a NoSQL storage solution, web crawling certainly seems to be it. I've used Elasticsearch for indexing and Cassandra for storage, performance was more than good enough for our use-cases. It was easy to scale, as well.

EmanueleMinotto · on Sept 29, 2013

A NoSQL solution is good because it's a DBMS (allows you to order collections). :) Filesystem is not good because you would need to order links' files in visited descendent order (not allowed in much filesystems), and to check if an URL is in the index you must store it with the MD5 as file name. A small DBMS like SQLite is not good for obvious reasons.

nostrademons · on Sept 29, 2013

I would highly recommend storing the working set of links in RAM (with checkpointing to write it out to disk periodically). A Redis Set (for visited links) + Sorted Set (for unvisited links, ordered by priority) is perfect for this, since it lets you take up one full machine's RAM and does checkpointing automatically. If your crawl is too big to fit in RAM, get more machines and shard by URL hash. As others have pointed out, the file content itself should go in files, ideally ones that you can write to with straight appends.

The reason you don't want to hit the disk with each link (as both MySQL and PostGres usually do, barring caching) is that there can be hundreds to thousands of links on a page. A disk hit takes ~10ms; if you need to run hundreds of those, it's well over a second per page just to figure out which links on it are unvisited. Accessing main memory is about 100,000 times faster; even with sharding and RPC overhead for a distributed memory cache, you end up way ahead.

The reason to write the crawl text to an append-only log file is because disk seek times are bound by the rotation speed of the disk, which hasn't changed much recently, while disk bandwidth is bound by the rotation time of the disk divided by capacity, which has gone way up. So appends are much more efficient on disk than seeks are.

gondo · on Sept 30, 2013

why do you need Cassandra? isn't it possible to use different storage engines for ES? (disk + memory)

cmiles74 · on Sept 30, 2013

I was storing a lot of raw data (the content of the page crawled, PDF files) and there's no real gain to keeping it inside of the index.

gondo · on Sept 30, 2013

hm im doing similar thing but im storing raw data in S3 and glacier. no need for nosql for this. or am i missing something?

rzendacott · on Sept 29, 2013

Udacity CS101 [1] also goes through the basics of building a web crawler. It's a lot more lightweight (no backend, etc), but it's a fun overview and can be completed pretty quickly.

[1]: https://www.udacity.com/course/cs101

mattcanhack · on Sept 30, 2013

I took this when it first came out, definitely a good course for beginners or intermediates.

staunch · on Sept 29, 2013

The only complicated part that I've run into when writing crawlers is accidental tarpits. It's very easy to run into a situation in which you're repeatedly requesting the same content via many different URLs.

For example, when a tracking parameter is added to any URL within the site:

http://example.com/?cid=104484&pid=12002348&ref=1294902

http://example.com/?cid=104484&pid=12002348&ref=1294904

http://example.com/?cid=104484&pid=12002348&ref=1294905

http://example.com/?cid=104484&pid=12002348&ref=1294906

You can quickly get to billions of permutations for a single site. The canonical tag solves the problem when it's there, but I still haven't seen a simple solution to the problem when it's not.

TheLoneWolfling · on Sept 30, 2013

Store a hash of each file you already have. If too many items from a single website collide, throw up a warning/error/flag for help/use some fuzzy method of identifying clashing URLs.

If that's too slow/space-intensive, try a bloom filter.

EmanueleMinotto · on Sept 29, 2013

If it's not included in robots.txt rules and doesn't have a canonical link that's not a problem, because the bot can't know if those pages are different or not so those pages you linked are different. This is the reason why crawlers can't try to fill forms.

If you are really sure that these pages are the same, try checking the body content (if two or more pages have the same MD5 of the content, those pages are the same) or look for a form that generate those URLs.

novaleaf · on Sept 30, 2013

great tip on the md5, thanks

0xbadcafebee · on Sept 30, 2013

Crawlers are one of those projects that's honestly best left to someone else. Fun for a hobby, but a nightmare to get right, and someone has already done the work for you. The exception is limited-use tools like Wget that can give you practical results for small-domain retrieval, but then kill you on CPU and memory and is impossible to scale; use a better tool or customize an existing one if you need to support large-scale crawls.

Some of the "little things" matter much more than your content analyzer or HTTP parsing - DNS performance and multi-homing being just a few that can have drastic effects.

Just as an example of how complex it gets, here's a brief overview of some of the features all crawlers should take into account: http://en.wikipedia.org/wiki/Web_crawler

drakaal · on Sept 29, 2013

The Content Extractor is overly simple. You don't want all the Anchor tags for most things, you want the ones that aren't part of the page template.

It took a lot of man hours to build the content extractor we use for our search engine.

https://www.mashape.com/stremor/stremor-content-extractor

Also because not every site has a SiteMap, or well linked site structure you may have to turn to Social like FB and Twitter if you want to get everything.

gpsarakis · on Sept 30, 2013

Are you using MySQL in your example?

  > id is an incremental value, I choose 11 as a length 
  for this primary key but this value is defined 
  by the number of pages you’ll need to index

This is a bit confusing. You'd generally be better off with INT UNSIGNED as it doubles the range for auto-increment columns.

Also the visited field would be better to be represented by a timestamp, choosing the right datatype does matter in large tables.

gondo · on Sept 30, 2013

couple of additional points: - encoding - your input will have different encodings and its quite hard to guess the correct one, however you should at least try to convert everything into 1 encoding (f.e. utf8)

- by setting CURLOPT_ENCODING to '' you don't have to worry about (un)gzipping as curl ll do this for you (or it should)

- it might be a good idea to use url hash as url id (f.e. crc32)

- you should check content-length and content-type to avoid downloading huge files

btw your coding style is very disturbing. there shouldn't be spaces before or after ->

elchief · on Sept 30, 2013

crawler4j is a nice open source (Apache 2) Java web crawler.

Multi-threaded. Built-in page delay (200ms default). Does HTTPS, headers, POST, cookies, follows robots.txt .

I prefer it to nutch for small to medium sized jobs.

https://code.google.com/p/crawler4j/

ghostdiver · on Sept 30, 2013

PHP + MySQL is the most unfortunate technology stack for this project.