At the Internet Archive we've created Heritrix for 'archival quality' crawling -...

bravura · on Feb 4, 2009

Could you outline how to configure it to collect only textual content?

gojomo · on Feb 4, 2009

Very roughly:

(1) Add a scope rule that throws out discovered URIs with popular non-textual extensions (.gif, .jpe?g, .mp3, etc.) before they are even queued.

(2) Add a 'mid-fetch' rule to FetchHTTP module that early-cancels any fetches with unwanted MIME types. (These rules run after HTTP headers are available.)

(3) add a processor rule to whatever is writing your content to disk (usually ARCWriterProcessor) that skips writing results (such as the early-cancelled non-textual results above) of unwanted MIME types.

Followup questions should go to the Heritrix project discussion list, http://groups.yahoo.com/group/archive-crawler/ .

bravura · on Feb 4, 2009

Also, what do you mean by "broad survey crawls"?

gojomo · on Feb 4, 2009

'Broad' crawl means roughly, "I'm interested in the whole web, let the crawler follow links everywhere." Even starting with a small seed set, such crawls quickly expand to touch hundreds of thousands or millions of sites.

Even as a fairly large operation, you're might to be happy with a representative/well-linked set of 10 million, 100 million, 1 billion, etc. URLs -- which is only a subset of the whole web, hence a 'survey'.

A constrasting kind of crawl would be to focus on some smaller set of sites/domains you want to crawl as deeply and completely as possible. You might invest weeks or many months to get these deeply in a gradual, polite manner.