Build your own web search engine

duggan · on Dec 21, 2011

Have you investigated http://www.elasticsearch.org/ ?

I'd really recommend reading up on the Solr (I notice you mention Solr in your first post) and ElasticSearch projects; these guys, along with Lucene, have collectively solved many of the problems you're investigating.

They're both open source, and (Solr at least) have extensive mailing lists so you can see the sorts of problems people face when building generalized search engines.

mattmiller · on Dec 21, 2011

Nutch will do the crawling and indexing for you. Solr has a web interface built in. You can build your own SE in a couple hours. Then you can do some clever machine learning based on usage data with Mahout.

http://nutch.apache.org/ http://mahout.apache.org/

sajid · on Dec 21, 2011

I recommend reading 'Managing Gigabytes' by Witten, Moffat and Bell:

http://www.amazon.com/Managing-Gigabytes-Compressing-Multime...

kenjackson · on Dec 21, 2011

Best of luck to the author.

Does remind me a broader question -- why is there no popular open source search engine? This seems much more tractable than an open source social network. I wouldn't be surprised if it could money/resources from major players like Facebook, Apple, Oracle. Apache has a lot of hte pieces, but no consumer facing front-end that ties it all together to search the web (AFAIK).

duggan · on Dec 21, 2011

Edit: there appears to be such a project, at least on the crawling side: http://www.commoncrawl.org/

I'd say there's a combination of factors, the first (and most important) being that Google is good enough for most people.

You'd need to coordinate crawling so as not to turn it into a giant DDoS machine; speed will be an issue due to geo distribution, variable hardware and result sets.

Validity and reliability of the data would also be issues, and would probably require several peers to "agree" to consistency, but in a way that does not allow easy gaming of results.

I suppose they're all solvable, though I think there would have to be a powerful incentive to do so. I imagine it'd be quite pricey too for the individual, though perhaps Gabriel Weinberg* could weigh in there.

[*] http://www.gabrielweinberg.com/

kenjackson · on Dec 21, 2011

I'd say there's a combination of factors, the first (and most important) being that Google is good enough for most people.

This is actually why I think it's important to have a good open source alternative. Google is good today. And frankly, if Bing wasn't around, Google could probably stop doing any work on search for the next five years with impunity.

fizx · on Dec 21, 2011

Gabe outsources his full-web index to Yahoo BOSS/Bing.

fizx · on Dec 21, 2011

Search engines are expensive to run. Roughly speaking, to do it well, you have to keep an up-to-date copy of the internet in RAM.

jacquesm · on Dec 21, 2011

Talk to the guy behind gigablast.com

nirvana · on Dec 21, 2011

I believe the solution to much of the problem lies in Riak. It is erlang based, has Map Reduce, is document oriented, has Free Text Search built in, is Solr compatible (though sketchy on details there) and is very scalable, and importantly, operationally easy for a small team.

I too found an impedance mismatch with CouchDB for what I'm working on (which is much like a search engine, but not quite), and found Riak to be a good solution.