4 Months with Cassandra at Cloudkick (YC W09)

petercooper · on March 3, 2010

I know Twitter's just done the dirty with Cassandra, but it seems Cassandra is getting a massive PR boost this week, despite being one of the older (where old is 2008!) NoSQL systems.

Can anyone explain/posit an idea as to why MongoDB and CouchDB have been stealing the thunder for the past year rather than Cassandra? It just seems odd.

fizx · on March 3, 2010

Cassandra is by most accounts a pain to setup and administor. Mongo has good docs.

ddispaltro · on March 3, 2010

Its actually very easy to setup. Install and run ./bin/cassandra -f nothing like the troubles with HBase.

Administration is pretty hands off for most tasks. For instance you want three copies of each piece of data? Set the replication factor to 3. Other parts are difficult to "get" if you don't have an understanding of the code, my best advice is to make sure you approach Cassandra from a developer perspective. You can't treat it like a black box quite yet.

asenchi · on March 3, 2010

Not sure where you are getting your information from, but Cassandra is one of the easiest setups I've used. It couldn't be simpler. I do not have experience with MongoDB though.

mahmud · on March 3, 2010

This stood out, negatively:

Since Cassandra uses Apache Thrift as the default RPC mechanism, exposing the Thrift layer to any non-controlled data can be dangerous. We use firewalls on our nodes to make sure our Thrift ports are only exposed to a very small set of machines, because even just telneting into the port and typing "hello" can cause the JVM to OOM.

--

I use Redis and heavily guard its telnetable port, but it doesn't OOM. This issue should have been fixed before public release, imo. You wouldn't want something as simple and common as a port scan to shutdown your data layer.

pquerna · on March 3, 2010

This is an issue specifically with Thrift -- not with Cassandra.

You can see the opinion of the Cassandra developers, which is pretty negative towards thrift, in the linked ticket: https://issues.apache.org/jira/browse/THRIFT-601

This is one of the many reasons the Cassandra developers are looking at replacing Thrift with Avro: http://hadoop.apache.org/avro/

rmanocha · on March 3, 2010

Can someone explain why Protocol Buffers (http://code.google.com/apis/protocolbuffers/) is not being used more widely?? It it 'cause of the limited language support (Java/C++/Python only) or some other reason??

jhammerb · on March 3, 2010

Google also kept the RPC side of Protocol Buffers, which is called Stubby, as an internal project.

evgen · on March 3, 2010

Protocol Buffers has limited language support, the implementation in one popular language (Python) blows goats, and even though it predates Thrift it was not released until after Thrift arrived so its community is smaller. The number of people who need this sort of product is relatively limited, so the early traction that Thrift has gained will take a bit of time or some compelling reasons to overcome.

wooster · on March 3, 2010

Myself, I've stopped using protobuf because it's super slow in Python.

ddispaltro · on March 3, 2010

It definitely has much better documentation.

viraptor · on March 2, 2010

I thought Cassandra used only keys for selects - but in this post I see you can also use slices of from..to values. Are there any other predicates that one can use? Like ones implementing 'LIKE string%' or 'LIKE %string%'?

It doesn't look like that from the API wiki, but maybe someone knows if that's possible, or planned.

jbellis · on March 2, 2010

all the from..to predicates can take a prefix in from. (i.e., LIKE foo%, but not LIKE %foo%).

('to' is optional, and technically, so is 'from'.)

on March 2, 2010

[deleted]

ddispaltro · on March 2, 2010

We actually run all those calculations "offline" basically we are trading disk space for speed.

js2 · on March 3, 2010

The post doesn't mention the environment in which you're running Cassandra. Any chance you're running it in the cloud (EC2?), or are you running it on real h/w?

ddispaltro · on March 3, 2010

We are running in a virtualized environment, we've done a pretty good job at adding capacity for performance and space reasons ahead of the demand. Loadbalancing is pretty hands off, its just something that takes some time and fundamental understanding, as there are still ways to shoot yourself in the foot with this system.

TimH · on March 3, 2010

The labels under the compaction graph show the four servers all running in the same slicehost datacenter (St. Louis-B).

aliasaria · on March 3, 2010

Anyone know what tool is used to draw the graph in the post? Is that an open javascript library?

ddispaltro · on March 3, 2010

Its actually amCharts a flash graphing tool.

http://amcharts.com/