Download the internet

HendrikR · on Feb 4, 2009

The number of pages crawled keeps counting up. One per second. A glorious glance into the source code shows, it's faked (the counting).

gojomo · on Feb 4, 2009

Their dump would be more useful if they...

(1) Used a preexisting aggregate web content format. Their ad hoc format is simple enough, but can't handle content with NULLs, and loses valuable information (such as time of capture -- you can't trust server 'Date' headers -- and resolved IP address at time of collection).

They could use the Internet Archive classic 'ARC' format (not to be confused with the older compression format of the same name):

http://www.archive.org/web/researcher/ArcFileFormat.php

Or the newer, more involved and chatty but still relatively straightforward 'WARC' format:

http://archive-access.sourceforge.net/warc/

(2) Explained how the 3.2 million pages in their initial dump were chosen. (That's only a tiny sliver of the web; where did they start and what did they decide to collect and put in this dataset?)

(FYI, I work at the Internet Archive.)

ericb · on Feb 4, 2009

I don't want to hijack the thread, but I'd love to see an "Ask someone who works at the Internet Archive" post.

wheels · on Feb 4, 2009

Also a list of indexed files would be interesting. I've recently looked for a spider of a couple of large sites (for us to use for demos for potential customers), but I'd like to know what's in there before sorting out space to decompress a 22 GB file.

I notice on the Internet Archive file that you link it mentions that those files are no longer accessible -- are there similar places that you can grab spidered content?

gojomo · on Feb 4, 2009

Bulk data access to the historic archive (other than via the public Wayback Machine) is currently only available by special arrangement with research projects. We don't really have a good system for enabling such access, so it happens rarely, on a case-by-case basis.

If you just need fresh web content, it's not hard to collect for yourself quite a bit of broad material in a short period on a small budget, with an open source crawler.

The data from Dotbot might be good, or potential data feeds from Wikia Search/Grub.

bravura · on Feb 4, 2009

If I'm not mistaken, wikia does not want you to download their crawls using a robot: http://soap.grub.org/arcs/robots.txt

gojomo · on Feb 4, 2009

They may not want automated downloads, but that robots.txt is in the wrong place. The standard only provides for robots.txt to be found and respected at the root (/robots.txt), not any subdirectory-lookalike path (/arcs/robots.txt).

ks · on Feb 4, 2009

I also find it strange that there are no 304 status codes. Does that mean that they blindly download each page when they update their index?

Unless you are Google or Yahoo with thousands of servers, you could save some time by only processing pages that have actually been modified.

bravura · on Feb 4, 2009

gojomo, I have looked at the file formats. Could you propose some off-the-shelf web spiders? I would like to accumulate a lot of text for NLP research.

brendano · on Feb 4, 2009

I don't work at Internet Archive, but I've found Heritrix is great. Crawling is all about crazy wonky edge cases and it deals with lots of them.

Use the version 1.x series, not the new version 2. 1.x is less convoluted and easier to use. There might be some http://en.wikipedia.org/wiki/Second-system_effect I think.

cnu · on Feb 4, 2009

Try out Nutch - http://lucene.apache.org/nutch/ or the one archive.org uses Heretix - http://crawler.archive.org/

gojomo · on Feb 4, 2009

At the Internet Archive we've created Heritrix for 'archival quality' crawling -- especially when you want to get every media type, and sites to complete/arbitrary depth, in large but polite crawls. (It's possible, but not usual for us, to configure it to only collect textual content.)

The Nutch crawler is also reasonable for broad survey crawls. HTTrack is also reasonable for 'mirroring' large groups of sites to a filesystem directory tree.

bravura · on Feb 4, 2009

Could you outline how to configure it to collect only textual content?

gojomo · on Feb 4, 2009

Very roughly:

(1) Add a scope rule that throws out discovered URIs with popular non-textual extensions (.gif, .jpe?g, .mp3, etc.) before they are even queued.

(2) Add a 'mid-fetch' rule to FetchHTTP module that early-cancels any fetches with unwanted MIME types. (These rules run after HTTP headers are available.)

(3) add a processor rule to whatever is writing your content to disk (usually ARCWriterProcessor) that skips writing results (such as the early-cancelled non-textual results above) of unwanted MIME types.

Followup questions should go to the Heritrix project discussion list, http://groups.yahoo.com/group/archive-crawler/ .

bravura · on Feb 4, 2009

Also, what do you mean by "broad survey crawls"?

gojomo · on Feb 4, 2009

'Broad' crawl means roughly, "I'm interested in the whole web, let the crawler follow links everywhere." Even starting with a small seed set, such crawls quickly expand to touch hundreds of thousands or millions of sites.

Even as a fairly large operation, you're might to be happy with a representative/well-linked set of 10 million, 100 million, 1 billion, etc. URLs -- which is only a subset of the whole web, hence a 'survey'.

A constrasting kind of crawl would be to focus on some smaller set of sites/domains you want to crawl as deeply and completely as possible. You might invest weeks or many months to get these deeply in a gradual, polite manner.

ojbyrne · on Feb 4, 2009

It's funny but just before I read this, somebody sent me some pictures of Barcelona (where I lived back in 2000-2001). I had a bunch of pictures on my (non-corporate) website from 2000 to about 2003. Rather than digging them up from somewhere on my various hard drives, I just turned to the internet archive. And there they were.

It also turned up this error message, forever preserved in amber, so to speak: http://web.archive.org/web/20030802202553/www.permafrost.net...

meatbag · on Feb 4, 2009

dotbot is one of the data sources that SEOmoz uses for their Linkscape crawler. Not sure why it was submitted here but I suspect promotional motives.

It is interesting data, but when people build crawlers to index the entire www, especially if the data is intended as an SEO intelligence tool, certain issues arise. Some background on this particular issue: http://incredibill.blogspot.com/2008/10/seomozs-new-linkscap...

daleharvey · on Feb 4, 2009

It was posted here because I thought it was interesting hacker news, I am in no way affiliated.

meatbag · on Feb 4, 2009

I very much agree that it is an interesting bit of news, I've been meaning to download the dataset and play with it for a while now. The only point I wanted to make was that there is another side to this story. An adequate summary can be found here: http://sphinn.com/story/80142 - I read the HN terms of use and didn't see anything that specifically forbids promotional stuff, but this tool is associated with a company that develops linkbait for a living. I don't mean to imply that this would make dotbot less noteworthy/interesting, but I am suspicious of linkbait. I lurk on HN because, frankly, much of the information sources in my industry are crapped up with "social media optimized" thin content. Again, sorry if I overreacted. This is my second post here and I don't mean to make a nuisance of myself.

inovica · on Feb 4, 2009

Are you sure that Linkscape use the submitted organisations bot? I agree that it could just be promotional - as they are only giving access to some of their content (around 10% at my calc) but I'm not sure we're talking about the same bot in this instance

meatbag · on Feb 4, 2009

from http://www.seomoz.org/linkscape/help/sources

    * Dotnetdotcom.org
    * Grub.org/Wikia
    * Page-Store.com
    * Amazon/Alexa’s crawl and internet archive resources
    * Exalead’s commercially available data
    * Gigablast’s commercially available data
    * Yahoo!’s BOSS API and other data sources
    * Microsoft’s Live API and other data sources
    * Google’s API and other data sources
    * Ask.com’s API and other data sources
    * Additional crawls from open source, commercial and academic projects

In my experience, the single most useful feature (main selling point) of the Linkscape tool is that it reports http status codes (for a price) so SEOs can detect 301 redirects, etc. AFAIK, dotnetdotcom.org has the only free, publicly available crawl data which also includes http status codes. Not sure about Exalead and Gigablast but I am pretty sure the other SEs don't release this information. To clarify: I don't have any proof, and things may have changed, but I've read some intelligent speculation (smarter than me) which claims that dotbot/dotnetdotcom.org provides the majority of the data (especially the unique info, like status codes) for the Linkscape tool.

Shamiq · on Feb 3, 2009

How could you use this?

Creative ideas please!

braindead_in · on Feb 3, 2009

Write a better page rank algorithm?

daleharvey · on Feb 4, 2009

I think an open source google could be a pretty great project, I would imagine its been tried before, but by seperating out the steps where these guys crawl, other people build indexes, and others handle lookups, it sounds more reasonable than one project taking on the whole thing

snprbob86 · on Feb 4, 2009

The biggest problem here is hosting the index... in RAM...

It takes A LOT of machines to power a modern search engine which serves any real amount of traffic. One key component of an open source search engine would be a sort-of peer-to-peer distributed infrastructure. When I suggested this in an earlier thread, people were quick to point out the liability concerns here... but maybe it could work somehow... but then how do you get people to sign up for it?

That said, I think this is incredibly interesting stuff. I would really love to see open source, peer-served web utilities. For example, I'd want access to many of the components of a web search engine, not just the search results themselves. Things like a language model for spell checking or word segmentation. Or a set analysis tool for detecting synonyms.

braindead_in · on Feb 4, 2009

Ya. Thats true. But purely from an data standpoint, having the same index as Google has would be valuable. Just because you can do so many things with it. Like, figuring out how to increase the page rank of your site!

known · on Feb 4, 2009

http://www..aspseek.org/ Open Source Implementation of PageRank

palish · on Feb 4, 2009

Why not use Google?

staticshock · on Feb 4, 2009

Why not use Windows?

palish · on Feb 4, 2009

I do. I also see nothing wrong with that.

coderrr · on Feb 4, 2009

Query for content inside of <script> or other similar tags which Google and all other search engines I have encountered ignore.

meatbag · on Feb 4, 2009

Draw a map of the WWW?

thorax · on Feb 4, 2009

search.wikia.com uses Grub: http://search.wikia.com/about/crawl.html

And that index is also open for download, though I haven't looked much into it.

parenthesis · on Feb 4, 2009

Do check out at the bottom in the "Dotbot Spider Statistics" table:

  # of Tubes Found Clogged 7188420

CalmQuiet · on Feb 4, 2009

Yes: they do have a sense of humor.

And: are thoughtful enough to include typos, because (as we all know) some people appreciate the opportunity to find errors). [ "...discussion of girlfriend/boyfriend/husband/wife issues are stickily prohibited." ]

juliend2 · on Feb 4, 2009

I love their navigation :

# Information on how to block our crawler. (Hint, it doesn't involve legal action)

# Our purpose and goal. (Yes we have one and no it doesn't involve spam)

# Our technology. (Thanks open source!)

[...]

YoavShapira · on Feb 4, 2009

It's good to see activity in this space, with more people and offerings (even if some of them are semi-dodgy). It's been too quiet for a while, with just the major search engines and other big players doing their own proprietary indices.

braindead_in · on Feb 3, 2009

cool idea. they should add a bit torrent link for downloading.

jonursenbach · on Feb 3, 2009

There's one there already.

KrisJordan · on Feb 4, 2009

Is their tracker working for anyone? Not so much here.

soult · on Feb 4, 2009

The tracker is down, but I found some by adding some open trackers and activating DHT:

* http://tracker.thepiratebay.org:80/announce

* http://denis.stalker.h3q.com:6969/announce

* http://tracker.soultcer.net:80/announce

_tggb · on Feb 4, 2009

You found the peers through DHT. Randomly adding trackers doesn't really do much to help the torrent. Not unless others have added the tracker as well. Which is unlikely if you added it on your own.

soult · on Feb 4, 2009

Since the first 2 are quite common "open trackers" (where open refers to the fact that they track any hash you submit), it is often quite likely to find sources there, because other people add them as well when they want more sources.

Furthermore my comment will lead to more people, who might not have a DHT-supporting client, adding those open trackers.

jonursenbach · on Feb 5, 2009

Which torrent clients out there don't support DHT these days?

braindead_in · on Feb 4, 2009

Cool! You guys have thought about everything.

Allocator2008 · on Feb 4, 2009

Awesome idea. I like the flat file based indexing management they mentioned. Gosh almost wish they had a link whereby one could send them one's C.V.! Keep up the good work.

ajkirwin · on Feb 4, 2009

Download the internet!

Or, at least, whatever doesn't block robots. :/

CalmQuiet · on Feb 4, 2009

Yes: doesn't seem like it's going to be very "deep web", does it?