Distributed Mutual Exclusion

Distributed Mutual Exclusion#

Content from Geek Time column: “Principles and Algorithm Analysis of Distributed Technology”

Mutual exclusion refers to a condition where a resource can only be occupied by one unit at the same time. In distributed systems, this exclusive access to resources is called Distributed Mutual Exclusion, and the shared resource that is accessed exclusively is called Critical Resource.

Several implementations of distributed mutual exclusion:

Centralized Algorithm#

Related to Yarn's resource scheduling mechanism

The system introduces a coordinator role. Each program sends a request to the coordinator when requesting resources. The coordinator then determines whether the resource is free. If it is free, it is directly allocated to the program; if it is occupied, the request is queued, similar to a request queue for resource allocation. However, there are two problems:

Single point of failure: If the coordinator fails, all programs will be unable to access the critical resource, rendering the entire system unavailable. Therefore, additional mechanisms are needed to ensure the normal operation of the coordinator (it feels similar to zookeeper).
The coordinator often becomes a performance bottleneck for the system, as the number of messages processed by the coordinator increases linearly with the number of programs needing access to the critical resource.

The diagram of the centralized algorithm is shown below:

Distributed Algorithm#

The file modification mechanism in HDFS uses this algorithm.

Unlike the centralized algorithm, the principle of the distributed algorithm is that when a program requests a resource, it first sends requests to all other programs in the system. If the other programs do not occupy the resource, they return an agreement. Only after receiving agreement messages from all other programs can the program access the critical resource.

This algorithm has low availability for two main reasons:

As the number of programs in the system increases, the number of messages grows exponentially with the number of programs needing access to the critical resource (2n * (n-1)), which can lead to a “signaling storm,” where the requests received by a program exceed its processing capacity, preventing normal business operations.
We know that ensuring message reliability in computer networks is also difficult, with issues such as message loss and delayed arrival. In distributed algorithms, similar issues arise. For example, if a program fails and cannot send a response message agreeing to request resources or delays its response, other programs may end up waiting, causing system blockage and unavailability.

The distributed algorithm allows each program to access resources fairly in chronological order based on the “first come, first served” and “unanimous voting” mechanisms. It is simple, straightforward, and easy to implement, suitable for scenarios where critical resource usage is infrequent and the system scale is small.

Token Ring Algorithm#

The diagram of the algorithm is shown below:

As shown, programs are arranged in a circular structure, and tokens are passed in a loop. The program that receives the token has the right to access the critical resource. After completing the access, it passes the token to the next program; if the program does not need to access the critical resource, it directly passes the token to the next program.

ps: Initially, I thought of the rate-limiting mechanism's token bucket algorithm, but they are actually different.

Summary & Thoughts#

Thoughts:#

The following thoughts may have issues; corrections are welcome.

Centralized Algorithm#

Improvement idea: Refer to the Redis cluster communication model, distributing a large number of requests to different masters through hash keys to handle a large volume of requests, with each master supported by a small cluster of master-slave nodes to ensure single point failure.

Distributed Algorithm#

Set a timeout; if a node fails and does not respond within the timeout limit, remove it from the system or ignore it directly.

Token Ring Algorithm#

Improvement idea: List weights based on participant usage frequency and select the next participant using a smooth weighted polling algorithm. It is also necessary to ensure that if a node in the ring fails, it is promptly removed to ensure the token is passed among normal nodes.

Why can't traditional mutual exclusion methods on a single machine be used in a distributed environment?#

Mutual exclusion on a single machine can use semaphores, lock mechanisms, thread synchronization, etc., which can achieve mutual exclusion at the JVM level or single machine hardware level. In a distributed environment, since each node runs on different servers and is unaware of each other, additional means are needed to ensure mutual exclusion access among nodes.