Hacker News new | past | comments | ask | show | jobs | submit login
Raft: Understandable Distributed Consensus (thesecretlivesofdata.com)
161 points by otoolep on Sept 5, 2014 | hide | past | favorite | 79 comments



In the example with 5 nodes and a split, it is my understanding that the two nodes can't elect a leader.

While the candidate in the smaller split receives votes from a majority of the split, there is no true majority, so no leader. The cluster is configured with the total number of nodes.

What could happen is that an already elected leader continues to think it's the leader for a while while the rest fo the cluster elects a new leader. The split leader will however fail to commit its log, and throw them away once it rejoins.

Another important detail that's missing is that node only votes once pere term, and only for a node that has an equal or higher term than itself. It will never vote twice or vote for an outdated node.

Changing the configuration is in fact handled in a special way at the end of the raft paper in a way that avoids split-brain.

[edit] Oh, the 2-node split was in fact already the leader, so it does exactly what I described. Dur...


It's important to note that this consensus model only works if all nodes are honest.


Given that nodes are trusted and communicate securely, why would a node be dishonest? (I'm thinking about this in a standard server setting: is your point that you can't use it for distributed peer-to-peer stuff on random user machines?)


A "dishonest" node doesn't necessarily have to be explicitly malicious, it can be simply faulty.


Lots of research has gone into this, under the name Byzantine fault tolerance.

http://en.wikipedia.org/wiki/Byzantine_fault_tolerance


For one thing, you can't use it for cryptocurrency stuff.


Yep.

It seems to me that homogeneity amongst nodes in a distributed consensus mechanism is of utmost importance; once you implement a hierarchical power structure dishonesty becomes difficult to deal with... especially when "voting" for a "leader" is involved in mapping the node landscape.

edit:spelling


Can someone explain to me how Raft differs from Viewstamped Replication? From reading both papers (vr revisited) it looks like Raft just renamed all of VR's nomenclature without changing anything significant. Paxos is fairly different since it only relies on a distinguished leader for guaranteed progress, it "works" without it. Under the hood the mechanism is still similar though, as opposed to something like chain replication.


See section 10 of the Raft paper.

> VR uses a leaderbased approach with many similarities to Raft.

> However, Raft has less mechanism that VR or ZooKeeper because it minimizes the functionality in non-leaders. For example, log entries in Raft flow in only one direction: outward from the leader in AppendEntries RPCs. In VR log entries flow in both directions (leaders can receive log entries during the election process); this results in additional mechanism and complexity

> Raft has fewer message types than any other algorithm for consensus-based log replication that we are aware of. For example, we counted the message types VR and ZooKeeper use for basic consensus and membership changes (excluding log compaction and client interaction, as these are nearly independent of the algorithms). VR and ZooKeeper each define 10 different message types, while Raft has only 4 message types (two RPC requests and their responses).


Raft itself is very nice, and the paper https://ramcloud.stanford.edu/wiki/download/attachments/1137... Does an excellent job in explaining it, but I have some problems with this claim: "A user study with 43 students at two universities shows that Raft is significantly easier to understand than Paxos: after learning both algorithms, 33 of these students were able to answer questions about Raft better than questions about Paxos".

Honestly, what I think happened is this: They first explained paxos to the poor students, then asked questions. In a later session explained Raft and asked questions. Can't it be the students started processing the problem of distributed consensus between the sessions so they got a better grasp of the topic? This would mean paxos helped them understand raft better. Anyway I'm nitpicking etc etc.


From 9.1 section in the paper,

> Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to deter- mine whether participants showed a better understanding of Raft.


This seems to be a good strategy but it is difficult to factor out the quality of the explanations.

The reason I'm so picky about this claim is that before you know it, you have mythical pseudo-statistical claims like "some programmers are more than 10 times as good as others" that will live a life of their own. CS has way too many of those.


For those who are encountering distributed systems for the first time, it might also do you some good to look up (or use as search fodder) "Paxos".

If you want to dive even deeper "FLP" and "Leslie Lamport" should also open up a can of interesting worms


Can't this lose commits? It looks like the message to commit the log entry to the followers happens after the message to the client to say that the commit is confirmed. The client can't actually tell when their commit has been replicated. If the leader dies before sending that confirm to the followers, the client will end up thinking the new leader has a commit which it's going to have to roll back.

Is this something the full algorithm handles differently to the way the diagrams would indicate?


The response to the client is only sent after a majority of followers have the log entry. That's described in the text in the "Protocol Overview", and nicely animated in "Log Replication".


Yes, they have the log entry, but it can still be rolled back, can't it? Here's how I understand the process:

1. Client sends log entry to leader

2. Leader appends log entry, forwards it to followers

3. Majority of followers confirm

4. Leader commits the log entry

5. Leader confirms the commit to the client

6. Followers commit on the next heartbeat

What happens if the leader goes away between 5 and 6? To my eyes, it looks like the followers will time out, elect a new leader, and have to roll back the last log entry.


If an entry has been replicated to a majority of followers, then the new leader is guaranteed to have that entry and therefore it won't be rolled back.


How does the new leader know it went to a majority? The only entity which could confirm that is the old leader.


That is correct. The solution to this is given in section 5.4.1 (election restriction), section 5.4.2 (Committing entries from previous terms) and section 8 (Client interaction) of the Raft paper.

Roughly, a newly elected leader will have all committed entries (guaranteed by the "election restriction", 5.4.2) but it does not know precisely which are committed. The new leader will commit a no-op log entry (section 8) and after it has received replies from a majority of the cluster it will know which entries have already been committed.


Ooh, thanks. The no-op commit is particularly interesting.


This is the whole idea of a distributed storage. You add many nodes (that are only confirmed to have the transaction in memory) to reduce the probability of all nodes crashing during one commit and this is considered more reliable than locking the whole data storage to wait for a confirmed disc write. I'm assuming you mean commit==write to disc?

ie. P(client queuing up too many requests and crashing because db is too slow waiting for disc) > P(5 distributed db nodes with the transaction in memory crashing simultaneously before it was written to disc)


I'm assuming "commit" means whatever "commit" means in the linked document.

EDIT: and it's not P(5...), it's P(Leader...) in the case I'm worrying about.


Either one or both sides of a split can't gain majority. Therefore you won't get a success message until you can be sure that the data won't get overwritten after a split is reconnected.


Horrible presentation - the enforced pauses (even hiding the continue button during them!) cause frustration after every single sentence.

Why can't I read at a normal pace instead of being interrupted all the time and having to wait while the next sentence is shown?

[edit] This behavior would be very suitable if I was making a presentation to an audience with this content - but it's quite contrary to what's needed for the audience to view the content themselves at their own pacing.


Well, I like it - a lot.

It would be even nicer if the "Continue" button had a permanent position (and if I could use enter/space/pagedown/... instead of mouse), but I didn't notice that it was hidden between animations. I guess I am slower than you are. :)

I am not sure if the concept is valid (some other comments have issue with that), but it was well presented. Good job, OP - keep it up!


You can use the arrow keys to navigate, which is a lot better than clicking "Continue" each time.


Arrow keys don't help much on a tablet or a phone.


Author here. Thanks for the feedback.

The presentation originally moved forward on its own but as you get into later sections there's a lot going on visually so it's easy to miss key points in the explanation. There's also an issue that the presentation is attached to the wall clock in later sections so you need the visual in a certain state before moving forward on the explanation.

In hindsight, I agree that there are better ways to present this information. Part of this project is to help me learn how to best communicate complicated topics like distributed consensus in the most effective way. Most existing resources are 20 page PhD dissertations which are not very accessible to beginners so I'm just figuring this out as I go.

I'm changing the format in future presentations. I'm working on infrastructure now to allow D3.js visualizations to be embedded into Medium blog posts. That'll give the best of both worlds -- read at your own pace and interact with mini visualizations as you go along.

I'm always looking for CS topics to visualize and help explain better. If you (or anyone) has any suggestions, please let me know. I'm @benbjohnson on Twitter.


+1 It's infuriatingly slow. This is almost worse than presenting text content as videos.


I thought it was great. The concepts are not simple and, for me at least, took time to process.


I disagree. This presentation was fantastic. I was going to post how much I loved it, so instead I'll come to its defense.


I liked it very much. This works well for individuals whose reading speed exceeds their speed of comprehension. For example, I tend to read way to fast for my own good and need something to pace me (only for technical texts, not fiction). This works very well. It allows me to focus on one thing at a time.


Its insulting. Especially the first few spoonfed sentences. Why not at the very least have the continue button come up instantly.


It's intended as a gentler introduction to Raft[1]. I thought it was great for someone with no real knowledge of distributed systems like myself.

[1] According to http://raftconsensus.github.io.


I only managed to click continue two or three times before I gave up in frustration.


You guys have a serious patience problem. Sure, it was a bit frustrating, but come on.


Much of this feedback misses the point -- the presentation is meant to supplement other forms of study. Read the paper a couple of times, and then come back to this. You'll appreciate its dynamic nature then.


This maybe a bit off-topic but I fail to understand why the top 2 textbooks on Distributed Computing - Tanenbaum and Coulouris - don't have a dedicated section on Consensus Algorithms. I learned distributed computing from Tanenbaum and can't recall encountering it.

Contrary some of the folks here, I found the presentation very cool. But that maybe because I'm a slow learner.


You may want to check out "Reliable and Secure Distributed Programming", it has a few chapters on Consensus.


So this site is using a custom built library called "playback.js":

https://github.com/benbjohnson/playback.js

looks interesting.


I accidentally clicked twice on "Continue" and there wasn't any mean to go back to read the missed comment. This slide-show player wants to be clean and simple and ends up with holes in functionality.


While this will not help for touch screen devices, left arrow goes back.


In the network partition example, you say that in the smaller partition, changes cannot be commited because they cannot be replicated to the majority of nodes (as the smaller partition is... smaller). How is the partition to know this? The system can't tell the difference between a node leaving the network and a node undergoing a (tempoary) partition.

To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?


> To give an example, say I have n machines in datacenter A, and n*.99 in datacenter B. datacenter A gets destroyed, permanently. Does datacenter B now reject all (EDIT: where reject = not commit) requests until a human comes along to tell it that datacenter A isn't coming back?

Of CAP, you are now choosing CP with Raft. So yes, the system is unavailable until an external agent fixes it. In other words, the system needs to have a majority of nodes online to be "available".


What would happen if nodes were to be added to each side of a network partition (unknown to the other side), so that each side believed they had a majority? Or is the "writing" side of the partition determined at partition time, and not changed until they are restored?


To add a new node to the network, it

* needs to have the same data as the other nodes

* needs a round of Raft to notify its presence to other nodes

So you can only add new nodes (automatically) when you have a 'live' system.

majority = ceil((2n + 1)/2) : so by getting the number of available nodes in the partition, nodes can figure out if they are in the majority or minority cluster.

See section 6 in the paper for details of its implementation.


In the goraft implementation https://github.com/goraft they use an explicit leave command that gets added to the log. This way if a leave command was not received, a partition can be assumed.


Very cool intro website.

Is this used in production somewhere already? Would love to hear more of the details about use cases and deployment.


Raft?

It's used in etcd, consul, serf and probably more.


Just a nitpick here with the qualifications that I'm one of the authors of Serf: Serf doesn't use Raft. Serf is masterless and the distributed messaging protocol used is SWIM (a gossip protocol).


Apologies, thanks for the correction.


Username "stonith" commenting on a post about Raft? I laughed out loud.

http://en.wikipedia.org/wiki/STONITH


I used it to build distributed replication of a SQLite database. Quite good fun.

https://github.com/otoolep/rqlite


InfluxDB too, a time-series database written entirely in Go.

http://influxdb.com


They've written a lot on the subject. For anyone who prefers the arrow key over the 'continue' button, https://speakerdeck.com/search?q=raft


The arrrow key works for me (in Chrome at least).


So what happens if a network partition occurs where both sides can elect a new leader?


This can't happen. You can't devide a cluster in half and still have a majority. See my other post.


Ah, ok so all nodes needs to be known beforehand


Yeah, and any changes to the set of nodes (adding or removing a node from the cluster) must be agreed upon by a majority of the existing nodes.


At most one side can elect a new leader, since there are an odd number of nodes.


Holy cow: social media is a means for distributing consensus.


How do you do these animations?


Sex.


Awesome


Where's the skip intro button on this horror show?



If I've understood the presentation correctly, Raft is a master-slave protocol that determines how to choose a master.

Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election (I.E. if two nodes receive the same amount of votes, we do a re-election ad infinitum) and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).

Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).


For the split vote see https://github.com/goraft/raft/issues/13

The conflicting data under partition is covered I think - the set that doesn't meet quorum won't accept writes and will see it has a lower election term than the other partition when it re-joins.

EDIT:

From a use vase point of view, it simplifies the construction of CP systems. This has directly led to etcd and consul, which would be many times more complex had their authors had to implement paxos.

Both etcd and consul are still young software, but if you take a look at the 'Call me Maybe' series of blog posts it's pretty apparent that there's a massive deficiency in current systems handling of network partitions.

http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-an...


    > If I've understood the presentation correctly, Raft is a 
    > master-slave protocol that determines how to choose a 
    > master.
You haven't understood the presentation correctly.


> and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).

I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)

> Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election

This is mostly true. The PDF slides I link to below recommend that the election timeout be much greater than the broadcast time, the idea being that things should work out in the long run.

Highly recommend the PDF slides here, as they explain it better than I can: https://ramcloud.stanford.edu/~ongaro/userstudy/ — there's also a YouTube talk here: https://www.youtube.com/watch?v=YbZ3zDzDnrw

> Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).

I'd love to hear of alternatives.


> I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)

What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader? how are is "majority" calculated? is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

> I'd love to hear of alternatives.

I no of no protocols per se, but for implementations of a master-slave protocol, there's mongo's replica-set algorithm (one notable change is that each node can have a priority).

There are also master-master implementations (such as cassandra's) that require no election, and serve IMO more interesting use-cases.


> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

A Raft cluster must have an odd number of nodes.

> how are is "majority" calculated?

ceil(nodes/2).

> is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

They cannot each choose a leader, see above.


> A Raft cluster must have an odd number of nodes.

Why must a Raft cluster have an odd number of nodes?

> > how are is "majority" calculated?

> ceil(nodes/2).

A majority is defined as having greater than half the votes. I.e., you need nodes / 2 + ((nodes + 1) % 2) votes, or more simply votes > nodes / 2. Even in an even-sized cluster that can only hold true for one node, and not cause splits.


> A Raft cluster must have an odd number of nodes.

what about a 7 to 3 / 3 / 1 split?


Not sure I understand. A node in each split cluster would need at least 4 votes to be elected leader. Hence no node can be elected leader since all split clusters have strictly fewer than 4 nodes.

Theorem. With 2n + 1 nodes, there can not be two separate majorities after a net split.

Proof. By way of contradiction, assume there are two separate majorities. Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes. This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the system, contradiction.


> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

No majority is possible here.

> how are is "majority" calculated?

The definition of majority is greater than half the set (that is the meaning of the word). If you have four members, 3 is the lowest such number that is greater than 4 / 2.


It does seem like the election timeout needs to be way longer than 150ms. The (default) TCP min_rto parameter on Linux is 200ms. A single lost packet could cause an election with the timeout from the paper. You'd either need to use longer timeouts or reduce min_rto to something sensible. And of course the equation changes for geographically distant servers.


You really need to give it a review. Maybe you've seen an earlier draft than me, but for me all of what you say is addressed.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: