Gigablast Search Engine, Now Open Source (C/C++)

conductor · on Aug 3, 2013

Gigablast (founded in 2000 by Matt Wells) announced [0] about open-sourcing their engine under the Apache version 2 license at July 30. The engine is written in mixture of C and C++ and counts more than 500,000 lines of code, see the Github page [1].

Some facts about the engine:

The code compiles into a single executable file which can scale on thousands servers.

It is easily configurable and has a nice documentation [2].

The code is very stable, it works in production since 2002.

Document processing is done using plugins, so you can write a plugin for any type of documents.

---

I would like to see a search engine based on this in the dark-nets, particularly in I2P.

[0] - http://www.prnewswire.com/news-releases/gigablast-now-an-ope...

[1] - https://github.com/gigablast/open-source-search-engine

[2] - https://www.gigablast.com/admin.html

X4 · on Aug 3, 2013

@conductor THANK YOU! THANK YOU! Thank you sooo much for posting this!!

I really really needed that just right now! You helped me soo much =) Thank you sir!

X4 · on Aug 4, 2013

Why on earth does somebody downvote a thank you post? That's very rude.

AsymetricCom · on Aug 4, 2013

2/10 extra points for effort

throwawayyyz · on Aug 4, 2013

I love how the code is such a mess. You can really tell one guy just wrote this whole thing over the span of a decade... It's just one patch on top of another and the comments are pretty amusing. Also funny to see hardcoded algorithms for pre-defined site paths and whole domains such as facebook/myspace/vimeo. This is truly a makeshift search engine on a massive scale.

EDIT: Gotta say, this has some very useful pieces of code. I'm working on a niche-specific crawler and am battling the url stripping/cleanup part of it. This is very useful: https://github.com/gigablast/open-source-search-engine/blob/...

runarb · on Aug 4, 2013

Just found a little gem myself. I am working on another open source search engine[0], and needed a way to make bad behaving document filters timeout.

Unfortunately the document filter in questioning dose spawn child processes, so the normal way of using fork() and a monitoring process was not working. However using ulimit like this should work: https://github.com/gigablast/open-source-search-engine/blob/... . Hadn’t thought about spanning a new shell and let it have control like that :)

0: https://github.com/searchdaimon/enterprise-search

conductor · on Aug 4, 2013

There is possible buffer overflow right there (if the HOME directory is long enough). Why don't people use snprintf?

runarb · on Aug 4, 2013

>Why don't people use snprintf?

Old habits perhaps? When I look back at it I remember that my first books on C were full of problematic sprintf and strcpy use. It may then easy to continue using what you first learned, even when you know better. It basically the "Baby duck syndrome"[0] for C functions.

0: http://en.wikipedia.org/wiki/Imprinting_(psychology)#Baby_du...

frik · on Aug 4, 2013

Features: http://www.gigablast.com/features.html

Interesting read, its history: http://www.gigablast.com/press.html

The great thing about this project is that it comes with good documentation for administrators and developers who want to extend it. As Gigablast has been sold to enterprise customers.

Admin Docu - how to build the source, troubleshooting, etc.: http://www.gigablast.com/admin.html

Developer Docu - even explains how to use Bash, GIT what to do on hardware failures, etc.: http://www.gigablast.com/developer.html

Two Search Engine features are currently disabled because of code overhaul: Boolean query support & Spellchecker. As Google is removing more and more such advanced features from its search engine - "+" anyone. It would be great if these features would celebrate a comeback, either from its original developer or with the help from the open source community.

Thanks for open-sourcing it.

busterc · on Aug 4, 2013

I'm really glad to see this open sourced. It could easily lead to a boom of niche web search engines.

BTW, long ago I hoped Gigablast would become a popular google competitor; no such luck. I remember asking Matt if I could provide an official IE toolbar (when they were the rage) he declined; sadly. My hope has shifted to duckduckgo.

I look forward to forking!

runarb · on Aug 4, 2013

Does anyone have any insights in what they (he?) plans to do now? Do they plan to continue development and operations, or are they open sourcing it because they are shutting down, and want their work to at least live on in some form?

c001 · on Aug 4, 2013

The code does not seem to be neatly written: have randomly checked a few files and found that the const methods and exceptions are not properly used. Here is a sample function:

const char *CountryCode::getAbbr(int index) { if(index < 0 || index > s_numCountryCodes) index = 0; return(s_countryCode[index]); }

https://github.com/gigablast/open-source-search-engine/blob/...

ck2 · on Aug 4, 2013

Gigablast was like the old Google, it was really neat years ago, but sadly never kept up.

Lots of details about its development on WebMasterWorld, it only uses a handful of servers.

emjaykay · on Aug 4, 2013

https://github.com/emmjaykay/open-source-search-engine

I couldn't get it to compile on my ubuntu 13 machine with out some errors and warnings, so I forked it and made some changes. i don't know git very well so i don't know how to merge, etc.

theGimp · on Aug 4, 2013

I looked at your fork, and it looks like you've already committed your source code to GitHub. All you would have to do now is submit a pull request.

However, given the scale of the project and the fact that the code has been in production for more than 10 years, it's more likely the errors you faced were due to:

- your local environment not being configured ideally, or

- "configuration code" that you did not modify. :)

emjaykay · on Aug 4, 2013

Thanks for the tip about github.

It says in html/admin.html to just type make to compile.

    You will need the following packages installed
    apt-get install make
    apt-get install g++
    apt-get install libssl-dev (for the includes, 32-bit libs are here)
    1. Run 'make' to compile. (e.g. use 'make -j 4' to compile on four cores)

theGimp · on Aug 8, 2013

Indeed you are right :)

I am yet to try installing libssl-dev though as I don't have root access on the machine I was testing on.

mindcrime · on Aug 3, 2013

I don't know much about Gigablast, but this sounds pretty cool. If nothing else, it's another alternative to Lucene/Solr or Nutch for people working on search applications.

gregwebs · on Aug 4, 2013

This isn't an alternative to general purpose text search engines: it is specialized for searching the internet.

mindcrime · on Aug 4, 2013

Right, so it's alternative in the cases where someone might use Lucene/Solr for indexing and searching general Internet content. That's all I meant, is that it's an alternative in certain very specific cases.

djinn · on Aug 4, 2013

Still an alternative to nutch and friends

ithkuil · on Aug 5, 2013

Anyone knows what the are the advantages here of using async io via signals instead of epoll. Does gigablast use this technique for historical reasons?

throwawayyyz · on Aug 4, 2013

Does anyone know what ever happened to Matt Wells' EventGuru.com project?

frik · on Aug 4, 2013

I found the Event Guru Blog. The last post is from Apr 17, 2012: "New Site Design": http://www.gigablast.com/egblog.html

The page is no more, Archive.org has no copy (due robots.txt flag) but Google has still a cached copy of the blog:

http://webcache.googleusercontent.com/search?q=cache:9lS6Ngk...

throwawayyyz · on Aug 4, 2013

Pretty odd to see 1995-era web design for a service launched in 2012. Thanks for digging :)