Thrift, Scribe, Hive, and Cassandra: Open Source Data Management Software

KirinDave · on Nov 8, 2008

This is interesting stuff (especially Cassandra), but I have to confess I've had nothing but troublesome and negative experiences with Thrift in my experiences with it, and right now I'd say that's the weakest link in this structure.

Thrift is fantastic when you're working in a C++-ish mode where CORBA-style IDL was as good as it got. But when it comes time to try and build something more flexible in terms of the protocol, or interface with scripting languages like Python which depends on a less typed kind of architecture, you're going to hit a lot of growing pains.

For example, I work on a project (which you can see public code for, check my github for Fuzed) that we'd like to look into providing a generic thrift interface for. But so far the Thrift infrastructure, despite being entirely capable of it, doesn't seem to show much interest in embracing flexible or generic protocols. Everything needs to be big-banged out up front, and this just doesn't jive with larger meta-projects building out fault-tolerant and flexible infrastructure that Hadoop doesn't meet the needs of.

This post's intent is not to say, "Thrift is terrible." Indeed, it is awesome at what it currently does. But if you want to go beyond that you're going to have to invest significant time and resources to get the protocol that binds all these amazing services up to snuff.

shailesh · on Nov 9, 2008

Interesting stuff. I also noticed another article that explains how to configure and use Scribe for Hadoop Log collection:

http://www.cloudera.com/blog/2008/11/02/configuring-and-usin...

The e-mail address at the end of the article goes to support id. That is a good way to attract customers in startup mode.

DenisM · on Nov 8, 2008

Amazing stuff.

I am particularly interested in Scribe:

A Thrift service for distributed logfile collection. Scribe was designed to run as a daemon process on every node in your data center and to forward log files from any process running on that machine back to a central pool of aggregators. Because of its ubiquity, a major design point was to make Scribe consume as little CPU as possible.