Raft: Understandable Distributed Consensus

pepijndevos · on Sept 5, 2014

In the example with 5 nodes and a split, it is my understanding that the two nodes can't elect a leader.

While the candidate in the smaller split receives votes from a majority of the split, there is no true majority, so no leader. The cluster is configured with the total number of nodes.

What could happen is that an already elected leader continues to think it's the leader for a while while the rest fo the cluster elects a new leader. The split leader will however fail to commit its log, and throw them away once it rejoins.

Another important detail that's missing is that node only votes once pere term, and only for a node that has an equal or higher term than itself. It will never vote twice or vote for an outdated node.

Changing the configuration is in fact handled in a special way at the end of the raft paper in a way that avoids split-brain.

[edit] Oh, the 2-node split was in fact already the leader, so it does exactly what I described. Dur...

FredericJ · on Sept 5, 2014

It's important to note that this consensus model only works if all nodes are honest.

deathanatos · on Sept 5, 2014

Given that nodes are trusted and communicate securely, why would a node be dishonest? (I'm thinking about this in a standard server setting: is your point that you can't use it for distributed peer-to-peer stuff on random user machines?)

the_schwarz · on Sept 5, 2014

A "dishonest" node doesn't necessarily have to be explicitly malicious, it can be simply faulty.

elbenshira · on Sept 5, 2014

Lots of research has gone into this, under the name Byzantine fault tolerance.

http://en.wikipedia.org/wiki/Byzantine_fault_tolerance

kijin · on Sept 5, 2014

For one thing, you can't use it for cryptocurrency stuff.

jedunnigan · on Sept 5, 2014

Yep.

It seems to me that homogeneity amongst nodes in a distributed consensus mechanism is of utmost importance; once you implement a hierarchical power structure dishonesty becomes difficult to deal with... especially when "voting" for a "leader" is involved in mapping the node landscape.

edit:spelling

grogers · on Sept 5, 2014

Can someone explain to me how Raft differs from Viewstamped Replication? From reading both papers (vr revisited) it looks like Raft just renamed all of VR's nomenclature without changing anything significant. Paxos is fairly different since it only relies on a distinguished leader for guaranteed progress, it "works" without it. Under the hood the mechanism is still similar though, as opposed to something like chain replication.

Muzzaf · on Sept 5, 2014

See section 10 of the Raft paper.

> VR uses a leaderbased approach with many similarities to Raft.

> However, Raft has less mechanism that VR or ZooKeeper because it minimizes the functionality in non-leaders. For example, log entries in Raft flow in only one direction: outward from the leader in AppendEntries RPCs. In VR log entries flow in both directions (leaders can receive log entries during the election process); this results in additional mechanism and complexity

> Raft has fewer message types than any other algorithm for consensus-based log replication that we are aware of. For example, we counted the message types VR and ZooKeeper use for basic consensus and membership changes (excluding log compaction and client interaction, as these are nearly independent of the algorithms). VR and ZooKeeper each define 10 different message types, while Raft has only 4 message types (two RPC requests and their responses).

toolslive · on Sept 5, 2014

Raft itself is very nice, and the paper https://ramcloud.stanford.edu/wiki/download/attachments/1137... Does an excellent job in explaining it, but I have some problems with this claim: "A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos".

Honestly, what I think happened is this: They first explained paxos to the poor students, then asked questions. In a later session explained Raft and asked questions. Can't it be the students started processing the problem of distributed consensus between the sessions so they got a better grasp of the topic? This would mean paxos helped them understand raft better. Anyway I'm nitpicking etc etc.

sbhat7 · on Sept 5, 2014

From 9.1 section in the paper,

> Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to deter- mine whether participants showed a better understanding of Raft.

toolslive · on Sept 5, 2014

This seems to be a good strategy but it is difficult to factor out the quality of the explanations.

The reason I'm so picky about this claim is that before you know it, you have mythical pseudo-statistical claims like "some programmers are more than 10 times as good as others" that will live a life of their own. CS has way too many of those.

hardwaresofton · on Sept 5, 2014

For those who are encountering distributed systems for the first time, it might also do you some good to look up (or use as search fodder) "Paxos".

If you want to dive even deeper "FLP" and "Leslie Lamport" should also open up a can of interesting worms

regularfry · on Sept 5, 2014

Can't this lose commits? It looks like the message to commit the log entry to the followers happens after the message to the client to say that the commit is confirmed. The client can't actually tell when their commit has been replicated. If the leader dies before sending that confirm to the followers, the client will end up thinking the new leader has a commit which it's going to have to roll back.

Is this something the full algorithm handles differently to the way the diagrams would indicate?

skew · on Sept 5, 2014

The response to the client is only sent after a majority of followers have the log entry. That's described in the text in the "Protocol Overview", and nicely animated in "Log Replication".

regularfry · on Sept 5, 2014

Yes, they have the log entry, but it can still be rolled back, can't it? Here's how I understand the process:

1. Client sends log entry to leader

2. Leader appends log entry, forwards it to followers

3. Majority of followers confirm

4. Leader commits the log entry

5. Leader confirms the commit to the client

6. Followers commit on the next heartbeat

What happens if the leader goes away between 5 and 6? To my eyes, it looks like the followers will time out, elect a new leader, and have to roll back the last log entry.

DanWaterworth · on Sept 5, 2014

If an entry has been replicated to a majority of followers, then the new leader is guaranteed to have that entry and therefore it won't be rolled back.

regularfry · on Sept 5, 2014

How does the new leader know it went to a majority? The only entity which could confirm that is the old leader.

Muzzaf · on Sept 5, 2014

That is correct. The solution to this is given in section 5.4.1 (election restriction), section 5.4.2 (Committing entries from previous terms) and section 8 (Client interaction) of the Raft paper.

Roughly, a newly elected leader will have all committed entries (guaranteed by the "election restriction", 5.4.2) but it does not know precisely which are committed. The new leader will commit a no-op log entry (section 8) and after it has received replies from a majority of the cluster it will know which entries have already been committed.

regularfry · on Sept 5, 2014

Ooh, thanks. The no-op commit is particularly interesting.

Too · on Sept 5, 2014

This is the whole idea of a distributed storage. You add many nodes (that are only confirmed to have the transaction in memory) to reduce the probability of all nodes crashing during one commit and this is considered more reliable than locking the whole data storage to wait for a confirmed disc write. I'm assuming you mean commit==write to disc?

ie. P(client queuing up too many requests and crashing because db is too slow waiting for disc) > P(5 distributed db nodes with the transaction in memory crashing simultaneously before it was written to disc)

regularfry · on Sept 5, 2014

I'm assuming "commit" means whatever "commit" means in the linked document.

EDIT: and it's not P(5...), it's P(Leader...) in the case I'm worrying about.

erikb · on Sept 5, 2014

Either one or both sides of a split can't gain majority. Therefore you won't get a success message until you can be sure that the data won't get overwritten after a split is reconnected.

PeterisP · on Sept 5, 2014

Horrible presentation - the enforced pauses (even hiding the continue button during them!) cause frustration after every single sentence.

Why can't I read at a normal pace instead of being interrupted all the time and having to wait while the next sentence is shown?

[edit] This behavior would be very suitable if I was making a presentation to an audience with this content - but it's quite contrary to what's needed for the audience to view the content themselves at their own pacing.

annnnd · on Sept 5, 2014

Well, I like it - a lot.

It would be even nicer if the "Continue" button had a permanent position (and if I could use enter/space/pagedown/... instead of mouse), but I didn't notice that it was hidden between animations. I guess I am slower than you are. :)

I am not sure if the concept is valid (some other comments have issue with that), but it was well presented. Good job, OP - keep it up!

lambda · on Sept 5, 2014

You can use the arrow keys to navigate, which is a lot better than clicking "Continue" each time.

atombender · on Sept 5, 2014

Arrow keys don't help much on a tablet or a phone.

benbjohnson · on Sept 5, 2014

Author here. Thanks for the feedback.

The presentation originally moved forward on its own but as you get into later sections there's a lot going on visually so it's easy to miss key points in the explanation. There's also an issue that the presentation is attached to the wall clock in later sections so you need the visual in a certain state before moving forward on the explanation.

In hindsight, I agree that there are better ways to present this information. Part of this project is to help me learn how to best communicate complicated topics like distributed consensus in the most effective way. Most existing resources are 20 page PhD dissertations which are not very accessible to beginners so I'm just figuring this out as I go.

I'm changing the format in future presentations. I'm working on infrastructure now to allow D3.js visualizations to be embedded into Medium blog posts. That'll give the best of both worlds -- read at your own pace and interact with mini visualizations as you go along.

I'm always looking for CS topics to visualize and help explain better. If you (or anyone) has any suggestions, please let me know. I'm @benbjohnson on Twitter.

pjc50 · on Sept 5, 2014

+1 It's infuriatingly slow. This is almost worse than presenting text content as videos.

zeroxfe · on Sept 5, 2014

I thought it was great. The concepts are not simple and, for me at least, took time to process.

dangayle · on Sept 5, 2014

I disagree. This presentation was fantastic. I was going to post how much I loved it, so instead I'll come to its defense.

hliyan · on Sept 5, 2014

I liked it very much. This works well for individuals whose reading speed exceeds their speed of comprehension. For example, I tend to read way to fast for my own good and need something to pace me (only for technical texts, not fiction). This works very well. It allows me to focus on one thing at a time.

JoeAltmaier · on Sept 5, 2014

Its insulting. Especially the first few spoonfed sentences. Why not at the very least have the continue button come up instantly.

pit · on Sept 5, 2014

It's intended as a gentler introduction to Raft[1]. I thought it was great for someone with no real knowledge of distributed systems like myself.

[1] According to http://raftconsensus.github.io.

sfilmeyer · on Sept 5, 2014

I only managed to click continue two or three times before I gave up in frustration.

stfp · on Sept 5, 2014

You guys have a serious patience problem. Sure, it was a bit frustrating, but come on.

otoolep · on Sept 5, 2014

Much of this feedback misses the point -- the presentation is meant to supplement other forms of study. Read the paper a couple of times, and then come back to this. You'll appreciate its dynamic nature then.

blutoot · on Sept 5, 2014

This maybe a bit off-topic but I fail to understand why the top 2 textbooks on Distributed Computing - Tanenbaum and Coulouris - don't have a dedicated section on Consensus Algorithms. I learned distributed computing from Tanenbaum and can't recall encountering it.

Contrary some of the folks here, I found the presentation very cool. But that maybe because I'm a slow learner.

tigeba · on Sept 5, 2014

You may want to check out "Reliable and Secure Distributed Programming", it has a few chapters on Consensus.

Illniyar · on Sept 5, 2014

So this site is using a custom built library called "playback.js":

https://github.com/benbjohnson/playback.js

looks interesting.

restalis · on Sept 5, 2014

I accidentally clicked twice on "Continue" and there wasn't any mean to go back to read the missed comment. This slide-show player wants to be clean and simple and ends up with holes in functionality.

evrenesat · on Sept 5, 2014

While this will not help for touch screen devices, left arrow goes back.

lclarkmichalek · on Sept 5, 2014

In the network partition example, you say that in the smaller partition, changes cannot be commited because they cannot be replicated to the majority of nodes (as the smaller partition is... smaller). How is the partition to know this? The system can't tell the difference between a node leaving the network and a node undergoing a (tempoary) partition.

To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?

sbhat7 · on Sept 5, 2014

> To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?

Of CAP, you are now choosing CP with Raft. So yes, the system is unavailable until an external agent fixes it. In other words, the system needs to have a majority of nodes online to be "available".

lclarkmichalek · on Sept 5, 2014

What would happen if nodes were to be added to each side of a network partition (unknown to the other side), so that each side believed they had a majority? Or is the "writing" side of the partition determined at partition time, and not changed until they are restored?

sbhat7 · on Sept 5, 2014

To add a new node to the network, it

* needs to have the same data as the other nodes

* needs a round of Raft to notify its presence to other nodes

So you can only add new nodes (automatically) when you have a 'live' system.

majority = ceil((2n + 1)/2) : so by getting the number of available nodes in the partition, nodes can figure out if they are in the majority or minority cluster.

See section 6 in the paper for details of its implementation.

robmccoll · on Sept 5, 2014

In the goraft implementation https://github.com/goraft they use an explicit leave command that gets added to the log. This way if a leave command was not received, a partition can be assumed.

coreymgilmore · on Sept 5, 2014

Very cool intro website.

Is this used in production somewhere already? Would love to hear more of the details about use cases and deployment.

stonith · on Sept 5, 2014

Raft?

It's used in etcd, consul, serf and probably more.

mitchellh · on Sept 5, 2014

Just a nitpick here with the qualifications that I'm one of the authors of Serf: Serf doesn't use Raft. Serf is masterless and the distributed messaging protocol used is SWIM (a gossip protocol).

stonith · on Sept 5, 2014

Apologies, thanks for the correction.

andrewguenther · on Sept 5, 2014

Username "stonith" commenting on a post about Raft? I laughed out loud.

http://en.wikipedia.org/wiki/STONITH

otoolep · on Sept 5, 2014

I used it to build distributed replication of a SQLite database. Quite good fun.

https://github.com/otoolep/rqlite

otoolep · on Sept 5, 2014

InfluxDB too, a time-series database written entirely in Go.

http://influxdb.com

pandatigox · on Sept 5, 2014

They've written a lot on the subject. For anyone who prefers the arrow key over the 'continue' button, https://speakerdeck.com/search?q=raft

polskibus · on Sept 5, 2014

The arrrow key works for me (in Chrome at least).

rwinn · on Sept 5, 2014

So what happens if a network partition occurs where both sides can elect a new leader?

pepijndevos · on Sept 5, 2014

This can't happen. You can't devide a cluster in half and still have a majority. See my other post.

rwinn · on Sept 5, 2014

Ah, ok so all nodes needs to be known beforehand

knyt · on Sept 5, 2014

Yeah, and any changes to the set of nodes (adding or removing a node from the cluster) must be agreed upon by a majority of the existing nodes.

Muzzaf · on Sept 5, 2014

At most one side can elect a new leader, since there are an odd number of nodes.

pit · on Sept 5, 2014

Holy cow: social media is a means for distributing consensus.

fiatjaf · on Sept 5, 2014

How do you do these animations?

EGreg · on Sept 5, 2014

stfp · on Sept 5, 2014

Awesome

thrownaway2424 · on Sept 5, 2014

Where's the skip intro button on this horror show?

thrownaway2424 · on Sept 5, 2014

https://ramcloud.stanford.edu/wiki/download/attachments/1137...

Illniyar · on Sept 5, 2014

If I've understood the presentation correctly, Raft is a master-slave protocol that determines how to choose a master.

Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election (I.E. if two nodes receive the same amount of votes, we do a re-election ad infinitum) and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).

Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).

stonith · on Sept 5, 2014

For the split vote see https://github.com/goraft/raft/issues/13

The conflicting data under partition is covered I think - the set that doesn't meet quorum won't accept writes and will see it has a lower election term than the other partition when it re-joins.

EDIT:

From a use vase point of view, it simplifies the construction of CP systems. This has directly led to etcd and consul, which would be many times more complex had their authors had to implement paxos.

Both etcd and consul are still young software, but if you take a look at the 'Call me Maybe' series of blog posts it's pretty apparent that there's a massive deficiency in current systems handling of network partitions.

http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-an...

sagichmal · on Sept 5, 2014

    > If I've understood the presentation correctly, Raft is a 
    > master-slave protocol that determines how to choose a 
    > master.

You haven't understood the presentation correctly.

deathanatos · on Sept 5, 2014

> and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).

I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)

> Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election

This is mostly true. The PDF slides I link to below recommend that the election timeout be much greater than the broadcast time, the idea being that things should work out in the long run.

Highly recommend the PDF slides here, as they explain it better than I can: https://ramcloud.stanford.edu/~ongaro/userstudy/ — there's also a YouTube talk here: https://www.youtube.com/watch?v=YbZ3zDzDnrw

> Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).

I'd love to hear of alternatives.

Illniyar · on Sept 5, 2014

> I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)

What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader? how are is "majority" calculated? is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

> I'd love to hear of alternatives.

I no of no protocols per se, but for implementations of a master-slave protocol, there's mongo's replica-set algorithm (one notable change is that each node can have a priority).

There are also master-master implementations (such as cassandra's) that require no election, and serve IMO more interesting use-cases.

Muzzaf · on Sept 5, 2014

> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

A Raft cluster must have an odd number of nodes.

> how are is "majority" calculated?

ceil(nodes/2).

> is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

They cannot each choose a leader, see above.

deathanatos · on Sept 6, 2014

> A Raft cluster must have an odd number of nodes.

Why must a Raft cluster have an odd number of nodes?

> > how are is "majority" calculated?

> ceil(nodes/2).

A majority is defined as having greater than half the votes. I.e., you need nodes / 2 + ((nodes + 1) % 2) votes, or more simply votes > nodes / 2. Even in an even-sized cluster that can only hold true for one node, and not cause splits.

Flenser · on Sept 5, 2014

> A Raft cluster must have an odd number of nodes.

what about a 7 to 3 / 3 / 1 split?

Muzzaf · on Sept 5, 2014

Not sure I understand. A node in each split cluster would need at least 4 votes to be elected leader. Hence no node can be elected leader since all split clusters have strictly fewer than 4 nodes.

Theorem. With 2n + 1 nodes, there can not be two separate majorities after a net split.

Proof. By way of contradiction, assume there are two separate majorities. Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes. This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the system, contradiction.

deathanatos · on Sept 6, 2014

> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

No majority is possible here.

> how are is "majority" calculated?

The definition of majority is greater than half the set (that is the meaning of the word). If you have four members, 3 is the lowest such number that is greater than 4 / 2.

thrownaway2424 · on Sept 5, 2014

It does seem like the election timeout needs to be way longer than 150ms. The (default) TCP min_rto parameter on Linux is 200ms. A single lost packet could cause an election with the timeout from the paper. You'd either need to use longer timeouts or reduce min_rto to something sensible. And of course the equation changes for geographically distant servers.

erikb · on Sept 5, 2014

You really need to give it a review. Maybe you've seen an earlier draft than me, but for me all of what you say is addressed.