Hacker News new | past | comments | ask | show | jobs | submit login

blekko has been running a crawl+index of several billion pages for 2 years now, so perhaps I can talk about this a little.

If you want access to a big crawl to grep through it for interesting data, then Common Crawl is awesome and inexpensive and I don't think you can get anything like it for the price, unless your query is simple enough to run as a blekko webgrep (https://blekko.com/webgrep).

If you want to build a search engine, Common Crawl isn't so useful. Search engines want _directed_ crawling of the pages that they think are good. Crawling is only a small fraction of the total work done in a search engine. Search engines generally aren't on AWS, because the right configuration of machine isn't rented by Amazon -- serving queries needs SSDs or more ram and less cpu than what Amazon offers. So, what Common Crawl offers a search engine is higher costs and mostly bad data.




I believe Common Crawl will do a directed crawl if you contact them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: