Scaling Riak at Kiip

tptacek · on May 24, 2012

It's been a riak-y day for me, so I watched this avidly.

You're no doubt wondering what the issues were; they come 20 minutes in, and were (paraphrased):

* At scale in production, adding a new node took days to complete all the handoffs; they recommend adding new nodes as soon as it's looking like you need them, rather than waiting until you're redlining.

* 2i is slow, especially in EC2; a straight KV "get" is milliseconds-denominated; 2i index queries were taking multiple seconds. Use 2i, they say, but in background processes.

* Javascript MapReduce is slow; this is well known. They confirm Erlang MR was adequate.

* As the LevelDB keyspace grows, there's a stepping function in latency; 5ms, then 15ms, then 25ms; the solution is to add nodes. (LevelDB is Google's KV store, a new option for Riak, required if you're using secondary indexes).

* Riak Control didn't work for them over low-latency connections.

* Once, a Chef misconfiguration left the whole cluster flapping on and off, which corrupted the cluster; they recovered with Basho support. Be careful about adding and removing nodes rapidly.

* Similarly, flapping a single node caused the cluster to get into a state where it wouldn't converge again; the cluster worked but no nodes could be added until they (presumably?) restarted it.

dsl · on May 24, 2012

After watching the video, the takeaway for me was don't run $database on EC2 instances.

Last night I just finished replacing 6 maxed out medium instances with one $100 box from SoftLayer. :/

bravura · on May 25, 2012

At that price point, it seems that you can get FAR better specs on Honelive. Kimsufi.ie is even cheaper, if you don't mind your server being hosted in the UK.

edit:

Softlayer: Intel 4x2.40GHz, 2 GB RAM ECC, $160/mo Honelive: Intel i7-2600 16, 16 GB RAM, $120/mo Kimsufi.ie: Intel i7 4x 2(HT)x 2.66+ GHz, 24 GB, $60/mo

moonboots · on May 25, 2012

Have you heard any reviews of Honelive? I've been looking for a dedicated server provider in the US with comparable prices to Hetzner in Germany.

dsl · on May 25, 2012

I get a lot of nice little extras from SoftLayer. For example I can get boxes in the 5 corners of the internet (west, central, east, euro, asia) with free bandwidth between them.

stavros · on May 25, 2012

Has anyone used Kimsufi? Are they any good? I'm at Hetzner right now, but I never found any use for that box, so I'm just paying for it for nothing...

bravura · on May 25, 2012

I just signed up for kimsufi. They are the budget brand of OVH. Webhostingtalk gives them favorable reviews.

spudlyo · on May 24, 2012

And if you do, for God's sake don't run them on EBS volumes. If you don't know by now that EBS performance is highly variable, you really haven't been paying attention.

j2labs · on May 24, 2012

Very useful data. Do you have anything more comprehensive about your experience?

tptacek · on May 24, 2012

I don't work at Kiip. :)

j2labs · on May 24, 2012

I thought you meant to say you had additional personal experience, but perhaps you've just been reading about Riak today.

Thanks anyway.

tptacek · on May 25, 2012

I've had a 3-node hardware Riak cluster lying around for about 4 months now, but only just today cut over from Postgres to it. I can talk to you about how shiny Riak is on day 1, but as this presentation shows, the day 1 behavior of a Riak cluster can be a bit of a trap.

gregburd · on May 25, 2012

Great to hear that you're cutting over to Riak. We highly recommend that your minimum cluster size is your N value (replication value) + 2. I your case that likely means 5 nodes for your starting cluster, not 3. There are many reasons. http://basho.com/blog/technical/2012/04/27/Why-Your-Riak-Clu...

tptacek · on May 25, 2012

Obviously great to know. Thanks!

saturdayplace · on May 25, 2012

Thomas, I'm in the middle of evaluating document stores and am perhaps getting stars in my eyes from Riak's easy-to-scale story. I'd love to hear more about your experiences once your cluster has had time to marinate a bit.

mrkurt · on May 25, 2012

Interestingly enough, this is a very similar type list as you'd see for Mongo, especially in terms of overall effort.

That you'd encounter all these things at a 25mm daily ops level is pretty odd, though.

rogerbinns · on May 25, 2012

The very same folks did write about their MongoDB experience: http://blog.engineering.kiip.me/post/20988881092/a-year-with...

jasonmccay · on May 25, 2012

I think the point he was making was that the company in question never tried to scale horizontally with MongoDB because, in their words, "we believe horizontally scaling shouldn’t be necessary for the relatively small amount of ops per second we were sending to MongoDB."

Yet, they went and scaled horizontally with Riak and experienced pain.

Their opinion that it did not make sense to have to horizontally scale the "relatively small" number of ops they were sending to MongoDB is certainly their own, but then they horizontally scaled with Riak anyway and boasted about their 25MM ops per day scaling ... which, averaged out, is only about 280 ops per second.

In short, it was far from an apples to apples comparison.

tptacek · on May 25, 2012

It's probably significantly more than 280 ops/sec, given peak times.

Cloven · on May 25, 2012

"At scale in production, adding a new node took days to complete all the handoffs..."

That's a bit of a headscratcher. What is happening during those 'days' and what is the primary limiting factor?

I keep meaning to get into Riak, but then stuff like this where the system has crazy moments that are impossible to coherently reason about keep popping up.

tptacek · on May 26, 2012

It's not a crazy moment. They have a system running at real scale, and they found that while keeping the system up constantly with immense amounts of data in it, they were able to dynamically add a new node to their cluster --- just that balancing everything out took a lot of time for the system.

The operational challenge I infer from this is that they had waited to add that node until they really needed it, because their expectation was that adding the node would get them quick relief to their scaling issue. Instead, they got relief a few days later when the node was fully integrated.

Solution: don't wait to add nodes until the last minute.

Scramblejams · on May 24, 2012

Something they (understandably) didn't address in the video was fault tolerance on the Postgres side. What do people like these days for that?

mitchellh · on May 25, 2012

I'll address that here: We use streaming replication with a hot standby for fault tolerance on the PostgreSQL side. For backups in addition to the streaming replication we do WAL shipping to S3 every 15 minutes or 16 MB, and we do a full base backup to S3 daily during our low-traffic time of the day, which uploads around 30 GB to S3 daily (but LZO-compressed, so actually down to around 3 GB).

Scramblejams · on May 26, 2012

Thanks, Mitchell. Is your hot standby in a different zone (e.g. one in the West, one in the East) so you can handle one of their big outages?

mitchellh · on May 27, 2012

platform · on May 25, 2012

in the last couple of month reading this site, I see more negative MongoDB reviews than most other 'cool tech' stuff. On one hand MongoDB by carrot-or-stick was pushed into environments with high-write needs (OLTP kind of systems). On another hand having true secondary indexes and semistructured data makes Mongo a 'closest to RDBMS' choice. So things like global write lock, indexes must fit into memory, auto-sharding questions, single-thread map-reduce -- all are pretty significant limitations for an OLTP data store. I wish MongoDB does not get discouraged and instead steps back, reviews academic foundations of the system and pick a couple of use cases and optimise their builds for them (similar to datawarehouse vs oltp kind of offerings)

minaguib · on May 25, 2012

The glossing-over cassandra (because of Digg history etc..) was the point in the talk when I started raising eyebrows. Engineers evaluating systems should let the systems speak for themselves.

We have a fairly small cassandra cluster in production, serving over 50x the volume they mentioned in their talk, with good latency (real-time bidding) and not-too-painful operational footprint.

gsibble · on May 24, 2012

Great talk! Looking at Riak vs MongoDB right now for a production system in fact. The data isn't K/V though and we need rich queries so I'm not sure what our solution will be unfortunately.

bfrog · on May 24, 2012

Why not just use postgresql?

WALoeIII · on May 24, 2012

Agreed, start with a known entity. Take the one special table that is killing you and move it to a special database (Riak/Cassandra etc) suited for that task only once you understand exactly what you need.

tptacek · on May 24, 2012

I think Kiip agrees with you completely too.

jasonmccay · on May 25, 2012

I find it a better course of action to work on properly understanding your data problem and then evaluate the options available instead of just sticking with the stand-by solution ("Well, we've always done x!").

Just like with anything in technology ... there is going to be pain as you escalate the level of complexity and what you are trying to accomplish.

mitchellh · on May 25, 2012

I disagree, based on experience. On the surface, MongoDB did solve our data problem, and we thought we sufficiently researched the technology that it would work, even attending multiple MongoDB conferences and talking to 10gen employees.

The issue was that underlying architectural decisions in MongoDB ended up biting us and limiting us rather significantly. It could be argued that this is because MongoDB is a rather new piece of tech (I disagree, I think MongoDB is fundamentally flawed, but it doesn't matter in this argument).

Because of our experience, going with the standby IS the best choice, until you _need_ something else. Riak was a change necessitated by its fast growth such that horizontally scaling was necessary when you're in an environment such as EC2. Ignoring MongoDB specifically, the IDEA of MongoDB simply isn't correct here, a Dynamo-style K/V store is the correct option, and Riak happens to be a fantastic one.

taligent · on May 24, 2012

Because Postresql is not a NoSQL database despite the recent lipstick additions.

I love the database but it is not optimal if you have a fluid schema like many use cases have today.

lwat · on May 25, 2012

I recently asked if anyone could point me to an (open source) app on github or wherever that has a 'fluid schema' on a NoSQL system and nobody was able to show me one. Do you have a concrete example I can look at please?

lwat · on May 27, 2012

Nothing? For something as common as you say I'd have thought there would be many examples out there. I don't get how this can be touted so often when there's no concrete use cases for this 'feature'.

jasonmccay · on May 25, 2012

I think that you would find MongoDB to work nicely for you ... especially with the need for rich queries (which assumes, on my part, that you have more complex data needs, documents, etc.).

With all due respect to the Kiip engineering team, this wasn't a strong case for using Riak over MongoDB ... but rather the general pain that a engineering team feels when horizontally scaling in the cloud.

reality_hacker · on May 25, 2012

Anything concrete about why cassandra is bad?..

gregburd · on May 25, 2012

One key difference is that Riak will rebalance data across nodes as they are added or removed automatically, Cassandra will not. You have to manually adjust the partitioning of data, balancing it by hand.