Distributed Computing

Distributed Computing#

Content from

Geek Time Column: “Principles and Algorithm Analysis of Distributed Technology”

MapReduce Computing Model (Divide and Conquer)#

Divide and Conquer Concept#

Abbreviated as Divide and Conquer, it refers to breaking down a complex, difficult-to-solve large problem into smaller, simpler, or directly solvable subproblems. These subproblems are independent of each other and have the same form as the original problem. The subproblems are solved recursively, and then the solutions to the subproblems are combined to obtain the solution to the original problem.

Computing Process#

The entire MapReduce workflow can be summarized in five stages: Input, Splitting, Mapping, Reducing, and Final Result.

Fork-Join Computing Model#

Fork-Join is a native multithreading parallel processing framework provided by languages or libraries like Java, adopting a thread-level divide and conquer computing model. It fully utilizes the advantages of multi-core CPUs by recursively splitting a task into multiple "small tasks," which are executed in parallel on multiple processors, known as the Fork operation. Once the small tasks are completed, their results are merged to obtain the result of the original task, known as the Join operation.

Fork-Join cannot scale massively and is only suitable for running on a single Java Virtual Machine. Although multiple small tasks run on different processors, they can communicate with each other, and one thread can "steal" sub-tasks from other threads.

Stream Computing Model#

After the task in the MapReduce model is completed, the entire task process ends, which belongs to the short task model. However, starting and stopping task processes is time-consuming. Therefore, for stream data processing, which requires high real-time performance, the Stream computing model is often utilized.

Stream Working Principle#

Stream computing emphasizes real-time processing. Once data is generated, it is immediately processed. After a piece of data is processed, it is serialized and stored in a cache, then immediately transmitted over the network to the next node for further processing, rather than waiting for the cache to fill up like in MapReduce. To ensure data real-time performance, no data is stored in stream computing, flowing forward like water.

Computing Steps#

Submit Stream Computing Job: For stream computing jobs, the computing logic must be predefined and cannot be changed during execution; it can only be resubmitted.
Load Stream Data for Stream Computing: Once a stream computing job is started, it remains in a waiting state for event triggers. When a small batch of data enters, the system immediately executes the computing logic and quickly obtains results.
Continuously Output Computing Results: After obtaining the results of a small batch of data, the result data can be immediately written into online/batch systems without waiting for the overall data computation results, thus achieving real-time display of computing results.

Actor Computing Model#

The Actor model represents a distributed parallel computing model. This model has its own set of rules that define the internal computing logic of Actors and the communication rules between multiple Actors. In the Actor model, each Actor is equivalent to a component in the system and is a basic computing unit.

The three elements of the Actor model are state, behavior, and message, with a popular equation: Actor model = (state + behavior) + message.

Actor Workflow#

When Actor A and Actor B need to execute the Function logic in Actor C, they send messages to Actor C. Actor C's message queue stores the messages from Actor A and Actor B, and then executes the Function based on the order of the messages.

Actor Advantages and Disadvantages Analysis#

Advantages#

Achieves a Higher Level of Abstraction. Actors are similar to OOP objects, encapsulating state and behavior. However, Actors communicate asynchronously, allowing multiple Actors to run independently without interference, solving the competition problem present in OOP.
Non-blocking. In the Actor model, Actors communicate asynchronously, so when one Actor sends a message to another, it does not need to wait for a response and can continue running other tasks locally after sending the message. In other words, the Actor model avoids blocking by introducing a message-passing mechanism.
No Need for Locks. An Actor can only read one message from the Mailbox at a time, meaning that an Actor can only process one message at a time, acting as a natural mutex, eliminating the need for additional code locking.
High Concurrency. Each Actor only needs to process messages from its local Mailbox, allowing multiple Actors to work in parallel, thereby enhancing the overall parallel processing capability of the distributed system.
Easy to Scale. Each Actor can create multiple Actors, reducing the workload of a single Actor. When a local Actor cannot handle the load, it can start an Actor on a remote node and forward messages there.

Disadvantages#

The Actor model provides modules and encapsulation but lacks inheritance and layering, meaning that even if multiple Actors share common logic or code, this part must be rewritten in each Actor, resulting in low reusability and requiring a complete rewrite of the overall code when business logic changes.
The Actor model allows for the dynamic creation of multiple Actors, causing the behavior of the entire Actor model to change constantly, making it difficult to implement in engineering. Additionally, increasing the number of Actors also increases system overhead.
The Actor model is not suitable for systems with strict requirements on message processing order. Since messages in the Actor model are asynchronous, the execution order of each message cannot be determined. Although blocking Actors can be used to solve order issues, this would severely impact the task processing efficiency of the Actor model.

Applications of the Actor Model#

Akka
Quasar (Java)
Erlang/OTP

Summary of the Actor Model:

Pipeline Computing Model#

The pipeline technology in computers is a technique that splits each instruction into multiple steps, allowing different steps of multiple instructions to overlap operations, thus achieving parallel processing of several instructions. Modern CPU instructions adopt a pipeline design, dividing a CPU instruction into five stages: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB). In the distributed field, the pipeline computing model is similar; it splits a large task into multiple steps, with different steps executed by different processes.

Computing Process#

Taking data preprocessing in machine learning as an example, suppose there are 5 sample data points, and the data preprocessing process for each sample includes three steps: data deduplication, handling missing values, and data normalization, which need to be executed in order. In other words, the data preprocessing task can be divided into three subtasks: data deduplication → handling missing values → data normalization. If there are now three nodes, Node 1 executes data deduplication, Node 2 handles missing values, and Node 3 performs data normalization. Thus, when Node 1 finishes processing the data for Sample 1 and sends the processed data to Node 2, Node 1 can continue processing the data for Sample 2 while Node 2 processes the data for Sample 1, and so on, achieving parallel execution of multiple tasks.

Applications of the Pipeline Computing Model#

Machine learning pipeline tasks, such as TensorFlow
Apache Beam (not specifically researched, but based on the introduction, it is based on pipeline processing ideas)

Summary and Expansion#

Differences Between Stream Computing and Batch Computing#

What is the difference between the pipeline model and the MapReduce model, both of which involve breaking down large tasks into multiple subtasks?#

MapReduce is task-granular, dividing large tasks into multiple small tasks, each of which must execute the same complete steps. The same task can be executed in parallel, making it a mode of task parallelism.
In contrast, the pipeline computing model is step-granular, breaking a task into multiple steps, each executing different logic. Multiple tasks of the same type achieve parallel computation through overlapping steps, making it a mode of data parallelism.

Additionally, the relationships between their subtasks (steps) differ:

In MapReduce, each subtask can execute independently without interference, and after multiple subtasks complete, results are merged to obtain the overall task result, requiring no dependencies between subtasks.
In the pipeline model, multiple subtasks have dependencies, where the output of the previous subtask serves as the input for the next subtask.

In summary, the MapReduce computing model is suitable for task parallel scenarios, while the pipeline computing model is suitable for data parallel processing of tasks of the same type.

Yige