Distributed Resource Management and Scheduling

Distributed Resource Management and Scheduling#

Content from

Geek Time Column: “Principles and Algorithm Analysis of Distributed Technology”

Analysis of Google Cluster Management System Omega

Distributed Architecture#

Centralized Structure#

Consists of one or more servers forming a central server, where all data within the system is stored in the central server, and all business within the system is first processed by the central server, which uniformly conducts resource and task scheduling.

Google Borg
Kubernetes
Mesos

Decentralized Structure#

The execution of services and storage of data are distributed across different server clusters, with communication and coordination between server clusters occurring through message passing. Compared to centralized structures, decentralized structures reduce the pressure on a single or a cluster of computer systems, addressing single point bottlenecks and failures while enhancing system concurrency, making it more suitable for managing large-scale clusters.

Akka Cluster
Redis Cluster
Cassandra Cluster

Scheduling Architecture: Monolithic Scheduler#

The characteristic of a monolithic scheduler is that the resource scheduling and job management functions are all completed within a single process. A typical representative in the open-source community is the implementation of Hadoop JobTracker.

Disadvantages
Poor scalability: First, the scale of the cluster is limited; second, new scheduling strategies are difficult to integrate into the existing code. For example, if it previously only supported MapReduce jobs, now it needs to support streaming jobs, and embedding the scheduling strategy for streaming jobs into the monolithic scheduler is a challenging task.

Optimization Plan
Place each scheduling strategy in a separate path (module), with different jobs scheduled by different scheduling strategies. This approach can significantly shorten job response times when the job volume and cluster scale are relatively small, but since all scheduling strategies are still within a centralized component, the overall system scalability does not improve.

Scheduling Architecture: Two-Tier Scheduler#

The two-tier scheduler retains a simplified centralized scheduler, but the scheduling strategies are delegated to various application schedulers. Typical representatives of this type of scheduler are Mesos and YARN.

The responsibilities of the two-tier scheduler are: the first-tier scheduler manages resources and allocates resources to frameworks, while the second-tier scheduler receives resources allocated by the first-tier scheduler in the distributed cluster management system and matches them according to tasks and received resources.

Disadvantages

Each framework cannot know the real-time resource usage of the entire cluster.
Uses pessimistic locking with small concurrency granularity.
Taking Mesos as an example, the resource scheduler of Mesos only pushes all resources to any framework, and only after that framework returns the resource usage can it push resources to other frameworks. Therefore, there is actually a global lock in the Mesos resource scheduler, which greatly limits the system's concurrency.

Scheduling Architecture: Shared State Scheduler#

This scheduler simplifies the centralized resource scheduling module in the two-tier scheduler into some persistent shared data (state) and verification code for this data. The "shared data" here is essentially the real-time resource usage information of the entire cluster, with typical representatives including Google’s Omega, Microsoft’s Apollo, and Hashicorp’s Nomad container scheduler.

After introducing shared data, the concurrent access method for shared data becomes the core of the system design. For example, Omega adopts the multi-version concurrency control method (MVCC) found in traditional databases, which greatly enhances Omega's concurrency.

Resource Allocation Methods#

Two resource allocation methods:

incremental placement
all-or-nothing

For example: A task requires 2GB of memory, and a node has 1GB remaining. If this 1GB of memory is allocated to the task, it must wait for the node to release another 1GB of memory before the task can run. This method is called “incremental placement”, and Hadoop YARN adopts this incremental resource allocation method. If only nodes with more than 2GB of remaining memory are selected for the task, without considering others, it is called “all-or-nothing”, which is used by both Mesos and Omega.

Both methods have their pros and cons. “All-or-nothing” may cause job starvation (tasks with large resource demands may never get the resources they need), while “incremental placement” can lead to resources being idle for long periods, which can also cause job starvation. For example, if a service requires 10GB of memory and a node currently has 8GB remaining, the scheduler allocates these resources to it and waits for other tasks to release 2GB. However, if other tasks run for a very long time and may not release resources in the short term, this service will be unable to run for an extended period.

Summary#

Monolithic Scheduling
Managed by a central scheduler that oversees the resource information and task scheduling of the entire cluster, meaning all tasks can only be scheduled through the central scheduler. The advantage of this scheduling architecture is that the central scheduler has information about the node resources of the entire cluster, allowing for globally optimal scheduling. However, its disadvantages include lack of scheduling concurrency and the central server being a single point of bottleneck, limiting the supported scheduling scale and service types, and restricting the scheduling efficiency of the cluster, making it suitable for small-scale clusters.

Two-Tier Scheduling
Divides resource management and task scheduling into two layers. The first-tier scheduler is responsible for managing cluster resources and sending available resources to the second-tier scheduler; the second-tier scheduler receives resources sent by the first-tier scheduler and performs task scheduling. The advantage of this scheduling architecture is that it avoids the single point bottleneck problem of monolithic scheduling, allowing for support of larger service scales and more service types. However, its disadvantage is that the second-tier scheduler often has only partial observability of global resource information, so the task matching algorithm cannot achieve global optimality, making it suitable for medium-scale clusters.

Shared State Scheduling
Multiple schedulers, each of which can see the global resource information of the cluster and perform task scheduling based on this information. Compared to the other two scheduling architectures, the shared state scheduling architecture is suitable for the largest cluster scale.

The advantage of this scheduling architecture is that each scheduler can access the global resource information in the cluster, allowing the task matching algorithm to achieve global optimality. However, because each scheduler can perform task matching globally, multiple schedulers may match the same node simultaneously, leading to resource competition and conflicts.

Although Omega's paper claims to avoid conflicts through an optimistic locking mechanism, in engineering practice, if resource competition issues are not properly handled, resource conflicts may occur, leading to task scheduling failures. In such cases, users need to handle the failed tasks, such as rescheduling and maintaining task scheduling states, further increasing the complexity of task scheduling operations.

Summary Analysis in the Table Below