Return to Answer

added 1956 characters in body

Source Link

edited Dec 17, 2020 at 5:52

3.2k
7
18
35

Yes you are on the right path.

I suggest you read the paper Apache Hadoop YARN: Yet Another Resource Negotiator, and the Yarn architecture is very similar with your use case.

There are three major components:Major components

The per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants.
The ApplicationMaster (AM) is the process that coordinates the application’s execution in the cluster. The AM periodically heartbeats to the RM to affirm its its liveness and to update the record of its demand.
The NodeManager (NM) is the “worker” daemon in YARN. After registering with the RM, the the NM heartbeats its status and receives instructions (including killing unrespondedunresponsive containers as directed by the RM or the AM). NM also periodically monitors the health of the physical node.

Resource allocation

Excerpt from the paper,

A recent extension of the protocol allows the RM to symmetrically request resources back from an application. This typically happens when cluster resources become scarce and the scheduler decides to revoke some of the resources that were given to an application to maintain scheduling invariants. We use structures similar to ResourceRequests to capture the locality preferences (which could be strict or negotiable). AMs have some flexibility when fulfilling such ’preemption’ requests, e.g., by yielding containers that are less crucial for its work (for e.g. tasks that made only little progress till now), by checkpointing the state of a task, or by migrating the computation to other running containers. Overall, this allows applications to preserve work, in contrast to platforms that forcefully kill containers to satisfy resource constraints. If the application is noncollaborative, the RM can, after waiting for a certain amount of time, obtain the needed resources by instructing the NMs to forcibly terminate containers.

Resource usage,

(in the NM) Operators configure it to report memory, CPU, and other resources available at this node and allocated for YARN.

This configuration should be part of status in heartbeats between NM and RM.

Fault tolerance and availability

RM remains a single point of failure in YARN’s architecture. It recovers from its own failures by restoring its state from a persistent store on initialization. Once the recovery process is complete, it kills all the containers running in the cluster, including live AM. It then launches new instances of each AM.

AM failure does not affect the availability of the cluster.

When a NM fails, the RM detects it by timing out its heartbeat response, marks all the containers running on that node as killed, and reports the failure to all running AMs.

Yes you are on the right path.

I suggest you read the paper Apache Hadoop YARN: Yet Another Resource Negotiator, and the Yarn architecture is very similar with your use case.

There are three major components:

The per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants.
The ApplicationMaster (AM) is the process that coordinates the application’s execution in the cluster. The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand.
The NodeManager is the “worker” daemon in YARN. After registering with the RM, the NM heartbeats its status and receives instructions (including killing unresponded containers as directed by the RM or the AM). NM also periodically monitors the health of the physical node.

Yes you are on the right path.

I suggest you read the paper Apache Hadoop YARN: Yet Another Resource Negotiator, and the Yarn architecture is very similar with your use case.

Major components

The per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants.
The ApplicationMaster (AM) is the process that coordinates the application’s execution in the cluster. The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand.
The NodeManager (NM) is the “worker” daemon in YARN. After registering with the RM, the NM heartbeats its status and receives instructions (including killing unresponsive containers as directed by the RM or the AM). NM also periodically monitors the health of the physical node.

Resource allocation

Excerpt from the paper,

A recent extension of the protocol allows the RM to symmetrically request resources back from an application. This typically happens when cluster resources become scarce and the scheduler decides to revoke some of the resources that were given to an application to maintain scheduling invariants. We use structures similar to ResourceRequests to capture the locality preferences (which could be strict or negotiable). AMs have some flexibility when fulfilling such ’preemption’ requests, e.g., by yielding containers that are less crucial for its work (for e.g. tasks that made only little progress till now), by checkpointing the state of a task, or by migrating the computation to other running containers. Overall, this allows applications to preserve work, in contrast to platforms that forcefully kill containers to satisfy resource constraints. If the application is noncollaborative, the RM can, after waiting for a certain amount of time, obtain the needed resources by instructing the NMs to forcibly terminate containers.

Resource usage,

(in the NM) Operators configure it to report memory, CPU, and other resources available at this node and allocated for YARN.

This configuration should be part of status in heartbeats between NM and RM.

Fault tolerance and availability

RM remains a single point of failure in YARN’s architecture. It recovers from its own failures by restoring its state from a persistent store on initialization. Once the recovery process is complete, it kills all the containers running in the cluster, including live AM. It then launches new instances of each AM.

AM failure does not affect the availability of the cluster.

When a NM fails, the RM detects it by timing out its heartbeat response, marks all the containers running on that node as killed, and reports the failure to all running AMs.

Source Link

answered Dec 17, 2020 at 4:33

lennon310

3.2k
7
18
35

Yes you are on the right path.

I suggest you read the paper Apache Hadoop YARN: Yet Another Resource Negotiator, and the Yarn architecture is very similar with your use case.

There are three major components:

The per-cluster ResourceManager(RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants.
The ApplicationMaster (AM) is the process that coordinates the application’s execution in the cluster. The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand.
The NodeManager is the “worker” daemon in YARN. After registering with the RM, the NM heartbeats its status and receives instructions (including killing unresponded containers as directed by the RM or the AM). NM also periodically monitors the health of the physical node.