> and does not address the situation of two nodes having conflicting sets of dat...

Illniyar · on Sept 5, 2014

> I believe it does address this. Each log entry is either committed or not; an entry can only be committed if it has been replicated to a majority of nodes. Any node that lacks a committed entry cannot be elected master because of the election rules: a node will not vote for another node less complete than itself. Since a committed entry has been replicated to a majority, a node lacking that entry cannot receive a majority of the votes. (Thus the committed log entries will always be the same on all nodes (though some may be behind, and may only have a subset), which is the purpose of the protocol.)

What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader? how are is "majority" calculated? is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

> I'd love to hear of alternatives.

I no of no protocols per se, but for implementations of a master-slave protocol, there's mongo's replica-set algorithm (one notable change is that each node can have a priority).

There are also master-master implementations (such as cassandra's) that require no election, and serve IMO more interesting use-cases.

Muzzaf · on Sept 5, 2014

> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

A Raft cluster must have an odd number of nodes.

> how are is "majority" calculated?

ceil(nodes/2).

> is the raft protocol unable to handle half of it's nodes being taken down? What happens if two clusters break off, both choose a leader (if it's possible), both gets new writes and then both clusters come back together?

They cannot each choose a leader, see above.

deathanatos · on Sept 6, 2014

> A Raft cluster must have an odd number of nodes.

Why must a Raft cluster have an odd number of nodes?

> > how are is "majority" calculated?

> ceil(nodes/2).

A majority is defined as having greater than half the votes. I.e., you need nodes / 2 + ((nodes + 1) % 2) votes, or more simply votes > nodes / 2. Even in an even-sized cluster that can only hold true for one node, and not cause splits.

Flenser · on Sept 5, 2014

> A Raft cluster must have an odd number of nodes.

what about a 7 to 3 / 3 / 1 split?

Muzzaf · on Sept 5, 2014

Not sure I understand. A node in each split cluster would need at least 4 votes to be elected leader. Hence no node can be elected leader since all split clusters have strictly fewer than 4 nodes.

Theorem. With 2n + 1 nodes, there can not be two separate majorities after a net split.

Proof. By way of contradiction, assume there are two separate majorities. Each separate majority would contain at least ceil((2n + 1)/2) = n + 1 nodes. This implies that there are in total at least 2(n + 1) = 2n + 2 nodes in the system, contradiction.

deathanatos · on Sept 6, 2014

> What happens if (for instance) a 4 node cluster splits into 2 node clusters (I.E. a network fault between two data centers)- does each cluster choose a leader?

No majority is possible here.

> how are is "majority" calculated?

The definition of majority is greater than half the set (that is the meaning of the word). If you have four members, 3 is the lowest such number that is greater than 4 / 2.

thrownaway2424 · on Sept 5, 2014

It does seem like the election timeout needs to be way longer than 150ms. The (default) TCP min_rto parameter on Linux is 200ms. A single lost packet could cause an election with the timeout from the paper. You'd either need to use longer timeouts or reduce min_rto to something sensible. And of course the equation changes for geographically distant servers.