There are a few things that I hope you just left out in your description...
- There is no mention of implementing a crawl-delay. You should always wait for several seconds (better yet, a minute) between requests to the same host.
- Do you follow redirects when requesting the robots.txt? You should! Some sites send you a redirect to a different URL even for robots.txt. In most cases it is just a slightly different hostname, like www.domain.com instead of domain.com. But it can redirect you to somewhere completely different in some cases.
- You probably don't want to crawl anything that ends with .jpg, .gif, and definitely not something like .avi, .wmv or .mkv. There are a LOT more file-extensions that you'll want to ignore.
I agree with cmiles74 that using a database is probably a bad idea. For a sizeable crawl (say a billion pages) this database will get pretty damn big. I doubt that you will be able to get decent performance out of anything with "SQL" in its name for such a use-case, unless you throw a ton of hardware at it. Building your own specialized solution for this would probably be a lot faster and less resource-intensive.
>>>You should always wait for several seconds (better yet, a minute) between requests to the same host.
This a thousand times.
I learned this lesson AFTER getting several angry emails from Admins and getting outright banned from one site for not having any delay between the requests in the first crawler I built.
Its also missing some form of lock to prevent slow requests cascading (to the same host or IP) when the next instance of your crawler comes along. I did this once where I had a crawler spin up in a cron job. No problem, unless the previous crawler was still running.
Speaking from experience its far better to lock your crawler down from the beginning then to do it after you kill someones site. I did expose a few slow pages for some sites, but not before crashing the whole thing.
For http://www.samuru.com we store our crawl in Google's DataStore. It gets big. By not storing all of the page, (we do content extraction first) we save quite a bit. Typically the content has much less markup than the template does.
We extract other things like the Opengraph image and social media links for author and store them independent of the content.
I wanna mention Nutch here: http://nutch.apache.org/
since it has been around for a while and a lot of thought was put into its design. For instance, people are discussing data stores, Nutch uses Hadoop.
The web is probably bigger than you think, Google says "when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!" (July, 2008)
You might consider just crawling certain parts of the web, or using a search engine (api, like Yahoo! BOSS) to gather relevant links and crawl from there, using a depth limit. Just an idea.
I used to be on the Google indexing team. Disregarding limits on the length of URLs, the size of the visible web is already infinite. For instance, there are many calendar pages out there that will happily give you month after month ad infinitum if you keep following the "next" link.
Now, depending on how you prune your crawl to get rid of "uninteresting" content (such as infinite calendars) and how you deduplicate the pages you find, you'll come up with vastly varying estimates of how big the visible web is.
Edit: on a side note, don't crawl the web using a naive depth-first search. You'll get stuck in some uninteresting infinitely deep branch of the web.
You're right, I forgot to write it explicitly in the article but if someone will follow istructions (extract all <a> tags and add them to the index) that method is tacit.
You definitely want to store raw text in flat files and store metadata in a database. By metadata I'm not referring to only the values found in the html page's meta tags but other things like page hash, word count, link count, etc. It all depends on what you are doing. If it's in a database you will have a very very big file for that archives table (mongodb and mysql+innodb come to mind).
What would a filesystem do better than a DB in that case?
After all a filesystem is just a database who's primary keys are mostly filenames.
Most of them use the same data-structures (e.g. b+trees) and especially the newer ones turn into copy-on-write systems like e.g. CouchDB
I guess if you want to have backups of that table, you probably would like to keep it at a manageable size, but why not just save the html blobs in a separate table?
I respectfullly disagree. If ever there was a use-case for a NoSQL storage solution, web crawling certainly seems to be it. I've used Elasticsearch for indexing and Cassandra for storage, performance was more than good enough for our use-cases. It was easy to scale, as well.
A NoSQL solution is good because it's a DBMS (allows you to order collections). :)
Filesystem is not good because you would need to order links' files in visited descendent order (not allowed in much filesystems), and to check if an URL is in the index you must store it with the MD5 as file name.
A small DBMS like SQLite is not good for obvious reasons.
I would highly recommend storing the working set of links in RAM (with checkpointing to write it out to disk periodically). A Redis Set (for visited links) + Sorted Set (for unvisited links, ordered by priority) is perfect for this, since it lets you take up one full machine's RAM and does checkpointing automatically. If your crawl is too big to fit in RAM, get more machines and shard by URL hash. As others have pointed out, the file content itself should go in files, ideally ones that you can write to with straight appends.
The reason you don't want to hit the disk with each link (as both MySQL and PostGres usually do, barring caching) is that there can be hundreds to thousands of links on a page. A disk hit takes ~10ms; if you need to run hundreds of those, it's well over a second per page just to figure out which links on it are unvisited. Accessing main memory is about 100,000 times faster; even with sharding and RPC overhead for a distributed memory cache, you end up way ahead.
The reason to write the crawl text to an append-only log file is because disk seek times are bound by the rotation speed of the disk, which hasn't changed much recently, while disk bandwidth is bound by the rotation time of the disk divided by capacity, which has gone way up. So appends are much more efficient on disk than seeks are.
Udacity CS101 [1] also goes through the basics of building a web crawler. It's a lot more lightweight (no backend, etc), but it's a fun overview and can be completed pretty quickly.
The only complicated part that I've run into when writing crawlers is accidental tarpits. It's very easy to run into a situation in which you're repeatedly requesting the same content via many different URLs.
For example, when a tracking parameter is added to any URL within the site:
You can quickly get to billions of permutations for a single site. The canonical tag solves the problem when it's there, but I still haven't seen a simple solution to the problem when it's not.
Store a hash of each file you already have. If too many items from a single website collide, throw up a warning/error/flag for help/use some fuzzy method of identifying clashing URLs.
If that's too slow/space-intensive, try a bloom filter.
If it's not included in robots.txt rules and doesn't have a canonical link that's not a problem, because the bot can't know if those pages are different or not so those pages you linked are different.
This is the reason why crawlers can't try to fill forms.
If you are really sure that these pages are the same, try checking the body content (if two or more pages have the same MD5 of the content, those pages are the same) or look for a form that generate those URLs.
Crawlers are one of those projects that's honestly best left to someone else. Fun for a hobby, but a nightmare to get right, and someone has already done the work for you. The exception is limited-use tools like Wget that can give you practical results for small-domain retrieval, but then kill you on CPU and memory and is impossible to scale; use a better tool or customize an existing one if you need to support large-scale crawls.
Some of the "little things" matter much more than your content analyzer or HTTP parsing - DNS performance and multi-homing being just a few that can have drastic effects.
Just as an example of how complex it gets, here's a brief overview of some of the features all crawlers should take into account: http://en.wikipedia.org/wiki/Web_crawler
Also because not every site has a SiteMap, or well linked site structure you may have to turn to Social like FB and Twitter if you want to get everything.
couple of additional points:
- encoding - your input will have different encodings and its quite hard to guess the correct one, however you should at least try to convert everything into 1 encoding (f.e. utf8)
- by setting CURLOPT_ENCODING to '' you don't have to worry about (un)gzipping as curl ll do this for you (or it should)
- it might be a good idea to use url hash as url id (f.e. crc32)
- you should check content-length and content-type to avoid downloading huge files
btw your coding style is very disturbing. there shouldn't be spaces before or after ->
- There is no mention of implementing a crawl-delay. You should always wait for several seconds (better yet, a minute) between requests to the same host.
- Do you follow redirects when requesting the robots.txt? You should! Some sites send you a redirect to a different URL even for robots.txt. In most cases it is just a slightly different hostname, like www.domain.com instead of domain.com. But it can redirect you to somewhere completely different in some cases.
- You probably don't want to crawl anything that ends with .jpg, .gif, and definitely not something like .avi, .wmv or .mkv. There are a LOT more file-extensions that you'll want to ignore.
I agree with cmiles74 that using a database is probably a bad idea. For a sizeable crawl (say a billion pages) this database will get pretty damn big. I doubt that you will be able to get decent performance out of anything with "SQL" in its name for such a use-case, unless you throw a ton of hardware at it. Building your own specialized solution for this would probably be a lot faster and less resource-intensive.