Monitoring Cassandra at Scale

jolynch · on June 1, 2016

A deep dive into how data is stored on Cassandra rings and how to use this knowledge to give your operators notice before cluster issues become site issues.

Any feedback is super welcome. If folks like this we may consider working on upstreaming the approach into nodetool.

coredog64 · on June 2, 2016

This is awesome. We're rolling out a bunch of production apps using C* and having some difficulty in monitoring all the right things without Ops Center. We're getting there, but this looks like a shortcut to some stuff we put off as too difficult to do immediately.

welder · on June 2, 2016

Have you seen RethinkDB and what you think about it (replacing cassandra?)?

RethinkDB seems to have everything from Cassandra, without the complexity.

jolynch · on June 2, 2016

To be honest most of my exposure to RethinkDB is from Aphyr's article on it. We already run basically every alternative he mentions as superior for particular use cases. For example we run Zookeeper / replicated SQL for inter-key consistent actions and Cassandra for an AP store. When we need document semantics we have a pretty robust Elasticsearch setup. It didn't seem like RethinkDB would do all of those things better than those special purpose databases so I didn't really look into it too much.

Replacing Cassandra at Yelp would be a lot of effort, so we'd have to be sure that it's worth it. That being said, RethinkDB definitely looks interesting and I'll make sure it's on my list of datastores to evaluate.

superchink · on June 2, 2016

Do you have a link to this article from Aphyr?

jolynch · on June 2, 2016

https://aphyr.com/posts/329-jepsen-rethinkdb-2-1-5

ddorian43 · on June 1, 2016

Have you seen ScyllaDb and what you think about it (replacing cassandra?)?

jolynch · on June 1, 2016

We've had a few internal discussions about ScyllaDB. I think that the performance numbers look attractive, but we are concerned about maintainability. In particular we've invested fairly heavily in configuration management, tooling, monitoring, etc ... and Apache Cassandra seems to work pretty well for us.

We might try it out someday, but for now we're fairly happy with stock Cassandra.

spodkowinski · on June 2, 2016

Thanks for posting this, it's an interesting approach. Monitoring the availability of replicas instead of individual nodes probably makes sense. However, I'm wondering how this information is actionable for your team. How would you act differently in case your monitoring reports certain consistency levels becoming unavailable, compared to just reporting 2 unavailable nodes in your cluster?

jolynch · on June 2, 2016

It mostly came down to flexibility. The original form of our monitoring did do basically what you suggest, "one node failure is ok, two is not", but we found that was not good enough. That approach in our experience was:

1. Noisy. We had a lot of large deployments where they had high replication factors (e.g. RF=5 or 7) and they very much didn't care if 2 nodes failed, or 3, or even 4. They had the high replication factor for resilience to multiple rack failures and didn't want to get paged by a few racks failing.

2. Hard to generalize, especially with multi-tenant clusters. Size of cluster != replication of keyspaces. For example if we had a 50 node cluster, but had a keyspace with RF=1, a single node failure should be a pageable event. Why is there a RF=1 keyspace ... because devops means that developers sometimes do things like that.

3. Had poor attribution. If you have a large cluster with many keyspaces, one of which has a lower RF than the rest or a higher consistency level, then only the owner of those keyspaces care if we lose a node or two. When we're dealing with an incident we can rope in the teams owning specifically the keyspaces that are under-replicated so they can take appropriate action.

To be totally honest, mostly it just helps us find keyspaces that have low RF ... The number of times we found out the new Cassandra version we just deployed added another system table that had a SimpleReplicationStrategy with default replication of 2 ...

dragonne · on June 2, 2016

Thank you for posting this, and in particular for the pointer to Jolokia. Just what a Python shop needs to monitor Java infrastructure.