Hacker News new | past | comments | ask | show | jobs | submit login
4 Months with Cassandra at Cloudkick (YC W09) (cloudkick.com)
78 points by pquerna on March 2, 2010 | hide | past | favorite | 19 comments



I know Twitter's just done the dirty with Cassandra, but it seems Cassandra is getting a massive PR boost this week, despite being one of the older (where old is 2008!) NoSQL systems.

Can anyone explain/posit an idea as to why MongoDB and CouchDB have been stealing the thunder for the past year rather than Cassandra? It just seems odd.


Cassandra is by most accounts a pain to setup and administor. Mongo has good docs.


Its actually very easy to setup. Install and run ./bin/cassandra -f nothing like the troubles with HBase.

Administration is pretty hands off for most tasks. For instance you want three copies of each piece of data? Set the replication factor to 3. Other parts are difficult to "get" if you don't have an understanding of the code, my best advice is to make sure you approach Cassandra from a developer perspective. You can't treat it like a black box quite yet.


Not sure where you are getting your information from, but Cassandra is one of the easiest setups I've used. It couldn't be simpler. I do not have experience with MongoDB though.


This stood out, negatively:

Since Cassandra uses Apache Thrift as the default RPC mechanism, exposing the Thrift layer to any non-controlled data can be dangerous. We use firewalls on our nodes to make sure our Thrift ports are only exposed to a very small set of machines, because even just telneting into the port and typing "hello" can cause the JVM to OOM.

--

I use Redis and heavily guard its telnetable port, but it doesn't OOM. This issue should have been fixed before public release, imo. You wouldn't want something as simple and common as a port scan to shutdown your data layer.


This is an issue specifically with Thrift -- not with Cassandra.

You can see the opinion of the Cassandra developers, which is pretty negative towards thrift, in the linked ticket: https://issues.apache.org/jira/browse/THRIFT-601

This is one of the many reasons the Cassandra developers are looking at replacing Thrift with Avro: http://hadoop.apache.org/avro/


Can someone explain why Protocol Buffers (http://code.google.com/apis/protocolbuffers/) is not being used more widely?? It it 'cause of the limited language support (Java/C++/Python only) or some other reason??


Google also kept the RPC side of Protocol Buffers, which is called Stubby, as an internal project.


Protocol Buffers has limited language support, the implementation in one popular language (Python) blows goats, and even though it predates Thrift it was not released until after Thrift arrived so its community is smaller. The number of people who need this sort of product is relatively limited, so the early traction that Thrift has gained will take a bit of time or some compelling reasons to overcome.


Myself, I've stopped using protobuf because it's super slow in Python.


It definitely has much better documentation.


I thought Cassandra used only keys for selects - but in this post I see you can also use slices of from..to values. Are there any other predicates that one can use? Like ones implementing 'LIKE string%' or 'LIKE %string%'?

It doesn't look like that from the API wiki, but maybe someone knows if that's possible, or planned.


all the from..to predicates can take a prefix in from. (i.e., LIKE foo%, but not LIKE %foo%).

('to' is optional, and technically, so is 'from'.)


[deleted]


We actually run all those calculations "offline" basically we are trading disk space for speed.


The post doesn't mention the environment in which you're running Cassandra. Any chance you're running it in the cloud (EC2?), or are you running it on real h/w?


We are running in a virtualized environment, we've done a pretty good job at adding capacity for performance and space reasons ahead of the demand. Loadbalancing is pretty hands off, its just something that takes some time and fundamental understanding, as there are still ways to shoot yourself in the foot with this system.


The labels under the compaction graph show the four servers all running in the same slicehost datacenter (St. Louis-B).


Anyone know what tool is used to draw the graph in the post? Is that an open javascript library?


Its actually amCharts a flash graphing tool.

http://amcharts.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: