Gigablast (founded in 2000 by Matt Wells) announced [0] about open-sourcing their engine under the Apache version 2 license at July 30. The engine is written in mixture of C and C++ and counts more than 500,000 lines of code, see the Github page [1].
Some facts about the engine:
The code compiles into a single executable file which can scale on thousands servers.
It is easily configurable and has a nice documentation [2].
The code is very stable, it works in production since 2002.
Document processing is done using plugins, so you can write a plugin for any type of documents.
---
I would like to see a search engine based on this in the dark-nets, particularly in I2P.
I love how the code is such a mess. You can really tell one guy just wrote this whole thing over the span of a decade... It's just one patch on top of another and the comments are pretty amusing. Also funny to see hardcoded algorithms for pre-defined site paths and whole domains such as facebook/myspace/vimeo. This is truly a makeshift search engine on a massive scale.
Just found a little gem myself. I am working on another open source search engine[0], and needed a way to make bad behaving document filters timeout.
Unfortunately the document filter in questioning dose spawn child processes, so the normal way of using fork() and a monitoring process was not working. However using ulimit like this should work:
https://github.com/gigablast/open-source-search-engine/blob/... . Hadn’t thought about spanning a new shell and let it have control like that :)
Old habits perhaps? When I look back at it I remember that my first books on C were full of problematic sprintf and strcpy use. It may then easy to continue using what you first learned, even when you know better. It basically the "Baby duck syndrome"[0] for C functions.
The great thing about this project is that it comes with good documentation for administrators and developers who want to extend it. As Gigablast has been sold to enterprise customers.
Two Search Engine features are currently disabled because of code overhaul: Boolean query support & Spellchecker. As Google is removing more and more such advanced features from its search engine - "+" anyone. It would be great if these features would celebrate a comeback, either from its original developer or with the help from the open source community.
I'm really glad to see this open sourced. It could easily lead to a boom of niche web search engines.
BTW, long ago I hoped Gigablast would become a popular google competitor; no such luck. I remember asking Matt if I could provide an official IE toolbar (when they were the rage) he declined; sadly.
My hope has shifted to duckduckgo.
Does anyone have any insights in what they (he?) plans to do now? Do they plan to continue development and operations, or are they open sourcing it because they are shutting down, and want their work to at least live on in some form?
The code does not seem to be neatly written: have randomly checked a few files and found that the const methods and exceptions are not properly used. Here is a sample function:
const char *CountryCode::getAbbr(int index) {
if(index < 0 || index > s_numCountryCodes) index = 0;
return(s_countryCode[index]);
}
I couldn't get it to compile on my ubuntu 13 machine with out some errors and warnings, so I forked it and made some changes. i don't know git very well so i don't know how to merge, etc.
I looked at your fork, and it looks like you've already committed your source code to GitHub. All you would have to do now is submit a pull request.
However, given the scale of the project and the fact that the code has been in production for more than 10 years, it's more likely the errors you faced were due to:
- your local environment not being configured ideally, or
- "configuration code" that you did not modify. :)
It says in html/admin.html to just type make to compile.
You will need the following packages installed
apt-get install make
apt-get install g++
apt-get install libssl-dev (for the includes, 32-bit libs are here)
1. Run 'make' to compile. (e.g. use 'make -j 4' to compile on four cores)
I don't know much about Gigablast, but this sounds pretty cool. If nothing else, it's another alternative to Lucene/Solr or Nutch for people working on search applications.
Right, so it's alternative in the cases where someone might use Lucene/Solr for indexing and searching general Internet content. That's all I meant, is that it's an alternative in certain very specific cases.
Anyone knows what the are the advantages here of using async io via signals instead of epoll. Does gigablast use this technique for historical reasons?
Some facts about the engine:
The code compiles into a single executable file which can scale on thousands servers.
It is easily configurable and has a nice documentation [2].
The code is very stable, it works in production since 2002.
Document processing is done using plugins, so you can write a plugin for any type of documents.
---
I would like to see a search engine based on this in the dark-nets, particularly in I2P.
[0] - http://www.prnewswire.com/news-releases/gigablast-now-an-ope...
[1] - https://github.com/gigablast/open-source-search-engine
[2] - https://www.gigablast.com/admin.html