Distributed Election

Distributed Election Algorithm#

Content from

Geek Time Column: 《Principles and Algorithm Analysis of Distributed Technology》

Theoretical Foundations of Distributed Systems - Elections, Majorities, and Leases

Election is a common problem in distributed system practice. By breaking the peer relationship between nodes, the elected leader (also called master or coordinator) helps achieve transaction atomicity and improve decision-making efficiency. The concept of Quorum helps us reach decision consistency in the case of network partitioning and assists in selecting a unique leader in leader election scenarios. The central idea of Lease is that only one node can obtain the lease and become the leader during each lease duration, and the lease must be reissued upon expiration, ensuring that at most one leader exists at any given time, thus avoiding the dual-master problem that can arise from network congestion or momentary disconnections when only using a heartbeat mechanism.

Bully Algorithm (Election)#

The Bully algorithm is the most common election algorithm, requiring each node to have a corresponding ID, with the node having the highest ID becoming the leader. If the leader crashes, the node with the next highest ID is re-elected as the leader.

The Bully algorithm uses the following 3 types of messages during the election process:

Election message, used to initiate an election;
Alive message, a response to the Election message;
Victory message, a declaration of sovereignty sent by the successfully elected master node to other nodes.

Election Process#

Each node in the cluster determines whether its ID is the largest among the currently alive nodes. If it is, it directly sends a Victory message to other nodes, declaring its sovereignty;
If its ID is not the largest, it sends Election messages to all nodes with a larger ID and waits for responses from other nodes;
If this node does not receive Alive messages from other nodes within a given time frame, it considers itself the master node and sends a Victory message to other nodes, declaring itself the master;
If it receives an Alive message from a node with a larger ID, it waits for other nodes to send Victory messages; if this node receives an Election message from a node with a smaller ID, it replies with an Alive message, informing other nodes that it has a larger ID and that a re-election is needed.

Advantages and Disadvantages, and Application Scenarios#

Advantages:
The Bully algorithm's selection is particularly domineering and simple; whoever is alive and has the largest ID is the master node, and other nodes must unconditionally obey. The advantages of this algorithm are its fast election speed, low algorithm complexity, and ease of implementation.

Disadvantages:

It requires each node to have global node information, resulting in a significant amount of additional information storage;
Additionally, any new node with an ID larger than the current master node or a node that recovers from a failure may trigger a re-election, becoming the new master node. If this node frequently joins and leaves the cluster, it can lead to frequent master changes.

Application Scenario:

MongoDB's replica set failover feature

In MongoDB's distributed election, the last operation timestamp of a node is used to represent its ID. The node with the most recent timestamp has the largest ID, meaning that the most recently updated, alive node is the master node.

Raft Algorithm (Majority)#

The Raft algorithm is a typical majority voting election algorithm, with the core idea being "the minority follows the majority," where the node with the most votes becomes the master.

In the Raft algorithm election, there are 3 roles for cluster nodes:

Leader, the master node, with only one Leader at any given time, responsible for coordinating and managing other nodes;
Candidate, the candidate, where any node can become a Candidate, and nodes in this role can be elected as the new Leader;
Follower, the followers of the Leader, who cannot initiate elections.

Election Process#

At initialization, all nodes are in the Follower state.
When starting an election, all nodes transition from Follower to Candidate state and send election requests to other nodes.
Other nodes respond to whether they agree to become the master based on the order of received election requests. It is important to note that in each round of elections, a node can only cast one vote.
If the node that initiated the election request receives more than half of the votes, it becomes the master node, transitioning its state to Leader, while the states of other nodes change from Candidate to Follower. The Leader node and Follower nodes periodically send heartbeat messages to check if the master node is alive.
When the Leader node's term ends, meaning other servers begin the next round of election cycles, the Leader's state transitions from Leader to Follower, entering a new round of elections.

Advantages and Disadvantages, and Application Scenarios#

Advantages:

Fast election speed, low algorithm complexity, and easy implementation.
Stability is better than the Bully algorithm because when new nodes join or nodes recover from failure, it triggers an election but does not necessarily lead to a master change unless the new or recovered node receives more than half of the votes.

Disadvantages:
It requires that every node in the system can communicate with each other and that more than half of the votes are needed to successfully elect a master, resulting in a large communication volume.

Application Scenario:

Google’s open-source Kubernetes uses the etcd component, which employs the Raft algorithm for leader election and consistency.

ZAB Algorithm (Majority)#

The core of the ZAB election algorithm is "the minority follows the majority, with nodes having larger IDs prioritized to become the master." Therefore, during the election process, (vote_id, vote_zxID) is used to indicate which node is being voted for, where vote_id represents the ID of the voted node and vote_zxID represents the server zxID of the voted node. The principle for electing a master in the ZAB algorithm is that the node with the largest server_zxID becomes the Leader; if server_zxID is the same, the node with the largest server_id becomes the Leader.

When using the ZAB algorithm for elections, each node in the cluster has 3 roles:

Leader, the master node;
Follower, follower nodes;
Observer, observers with no voting rights.

During the election process, nodes in the cluster have 4 states:

Looking state, the election state. When a node is in this state, it believes there is currently no Leader in the cluster, so it enters the election state.
Leading state, the leader state, indicating that a master has been elected and the current node is the Leader.
Following state, the follower state, where after a master has been elected, the states of other non-master nodes update to Following, indicating their following of the Leader.
Observing state, the observer state, indicating that the current node is an Observer, taking a wait-and-see attitude, with no voting or election rights.

Election Process#

When the system starts, all 3 servers are currently voting in the first round, i.e., epoch=1, and zxID is 0. At this point, each server nominates itself and broadcasts the voting information.
Based on the judgment rules, since the epochs and zxIDs of the 3 servers are the same, the server_ids are compared, and the larger one is the nominated object. Therefore, Server 1 and Server 2 change their vote_id to 3, update their voting boxes, and re-broadcast their votes.
At this point, all servers in the system have nominated Server 3, so Server 3 is elected as the Leader, in the Leading state, sending heartbeat messages to other servers and maintaining connections; Server 1 and Server 2 are in the Following state.

Advantages and Disadvantages, and Application Scenarios#

Advantages:

High algorithm performance, with no special requirements on the system.
The stability of the election algorithm is relatively good; when new nodes join or nodes recover from failure, it triggers an election but does not necessarily lead to a master change.

Disadvantages:

Using a broadcast method to send information, if there are n nodes, each node broadcasting simultaneously results in n*(n-1) messages in the cluster, which can easily lead to a broadcast storm (similar to the "signaling storm" in distributed mutual exclusion).
In addition to voting, it also adds comparisons of node IDs and data IDs, meaning that it is necessary to know the IDs and data IDs of all nodes, thus increasing the election time.

Application Scenario:

Designed for implementing distributed coordination functions for ZooKeeper.

Summary Analysis#

Comparison Analysis of Three Algorithms#

Mind Map Summary#

Thought Expansion#

1. Why do "majority" election algorithms typically use an odd number of nodes instead of an even number?#

There may be cases where two nodes receive half of the votes each, and using an even number of nodes would prevent a master from being elected, necessitating a re-vote. However, even with a re-vote, the probability of two nodes having the same number of votes remains high.

2. What is the relationship between distributed elections and consistency?#

Consistency refers to whether multiple replicas can have the same value at the same time. Distributed elections are usually the basis for achieving consistency by electing a master node to coordinate and manage other nodes, thereby ensuring the orderly operation of other nodes and consistency among all nodes.

3. Is there a scenario where a cluster has dual masters?#

The dual-master situation generally arises due to network failures, such as network partitioning (where the cluster forms two separate network clusters due to network disconnection at a certain moment). During the period of dual masters, if both provide services, it may lead to data inconsistency within the cluster. Therefore, it is necessary to decide whether to allow services under dual-master conditions based on the business's tolerance for data inconsistency.

Associative Memory: The phenomenon of brain split in clusters, reference: