> and does not address the situation of two nodes having conflicting sets of data (for instance from network partition).
I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)
> Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election
This is mostly true. The PDF slides I link to below recommend that the election timeout be much greater than the broadcast time, the idea being that things should work out in the long run.
> I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)
What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader? how are is "majority" calculated? is the raft protocol unable to handle half of it's nodes being taken down?
What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?
> I'd love to hear of alternatives.
I no of no protocols per se, but for implementations of a master-slave protocol, there's mongo's replica-set algorithm (one notable change is that each node can have a priority).
There are also master-master implementations (such as cassandra's) that require no election, and serve IMO more interesting use-cases.
> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?
A Raft cluster must have an odd number of nodes.
> how are is "majority" calculated?
ceil(nodes/2).
> is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?
> A Raft cluster must have an odd number of nodes.
Why must a Raft cluster have an odd number of nodes?
> > how are is "majority" calculated?
> ceil(nodes/2).
A majority is defined as having greater than half the votes. I.e., you need nodes / 2 + ((nodes + 1) % 2) votes, or more simply votes > nodes / 2. Even in an even-sized cluster that can only hold true for one node, and not cause splits.
Not sure I understand. A node in each split cluster would need at least 4 votes to be elected leader. Hence no node can be elected leader since all split clusters have strictly fewer than 4 nodes.
Theorem. With 2n + 1 nodes, there can not be two separate majorities after a net split.
Proof. By way of contradiction, assume there are two separate majorities. Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes. This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the system, contradiction.
> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?
No majority is possible here.
> how are is "majority" calculated?
The definition of majority is greater than half the set (that is the meaning of the word). If you have four members, 3 is the lowest such number that is greater than 4 / 2.
It does seem like the election timeout needs to be way longer than 150ms. The (default) TCP min_rto parameter on Linux is 200ms. A single lost packet could cause an election with the timeout from the paper. You'd either need to use longer timeouts or reduce min_rto to something sensible. And of course the equation changes for geographically distant servers.
I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)
> Considering it basically relies on random chance (I.E. who receives the message first) to elect a master, has basically no real way of resolving a conflict in election
This is mostly true. The PDF slides I link to below recommend that the election timeout be much greater than the broadcast time, the idea being that things should work out in the long run.
Highly recommend the PDF slides here, as they explain it better than I can: https://ramcloud.stanford.edu/~ongaro/userstudy/ — there's also a YouTube talk here: https://www.youtube.com/watch?v=YbZ3zDzDnrw
> Considering all that, this protocol doesn't seem very interesting (from a use-case point of view).
I'd love to hear of alternatives.