Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics

In this multipart series, fully explore the tangled ball of thread that is YARN.

YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:

  • Provide a basic understanding of the components that make up YARN
  • Illustrate how a MapReduce job fits into the YARN model of computation. (Note: although Apache Spark integrates with YARN as well, this series will focus on MapReduce specifically. For information about Spark on YARN, see this post.)
  • Present an overview of how the YARN scheduler works and provide building-block examples for scheduler configuration

The series comprises the following parts:

  • Part 1: Cluster and YARN basics
  • Part 2: Global configuration basics
  • Part 3: Scheduler concepts
  • Part 4: FairScheduler queue basics
  • Part 5: Using FairScheduler queue properties

In this initial post, we’ll cover the fundamentals of YARN, which runs processes on a cluster similarly to the way an operating system runs processes on a standalone computer. Subsequent parts will be released every few weeks.

Cluster Basics (Master/Worker)

host is the Hadoop term for a computer (also called a node, in YARN terminology). A cluster is two or more hosts connected by a high-speed local network. Two or more hosts—the Hadoop term for a computer (also called a node in YARN terminology)—connected by a high-speed local network are called a cluster. From the standpoint of Hadoop, there can be several thousand hosts in a cluster.

In Hadoop, there are two types of hosts in the cluster.

Figure 1: Master host and Worker hosts

Conceptually, a master host is the communication point for a client program. A master host sends the work to the rest of the cluster, which consists of worker hosts. (In Hadoop, a cluster can technically be a single host. Such a setup is typically used for debugging or simple testing, and is not recommended for a typical Hadoop workload.)

YARN Cluster Basics (Master/ResourceManager, Worker/NodeManager)

In a YARN cluster, there are two types of hosts:

  • The ResourceManager is the master daemon that communicates with the client, tracks resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
  • NodeManager is a worker daemon that launches and tracks processes spawned on worker hosts.

Figure 2: Master host with ResourceManager and Worker hosts with NodeManager

YARN Configuration File

The YARN configuration file is an XML file that contains properties. This file is placed in a well-known location on each host in the cluster and is used to configure the ResourceManager and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file used to configure YARN are covered in the later sections.

YARN Requires a Global View

YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own local resources and communicates its resource configuration to the ResourceManager, which keeps a running total of the cluster’s available resources. By keeping track of the total, the ResourceManager knows how to allocate resources as they are requested. (Vcore has a special meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of vcores to physical cores higher than 1 to maximize your use of hardware resources.)

Figure 3: ResourceManager global view of the cluster

Containers

Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. Currently, a container hold request consists of vcore and memory, as shown in Figure 4 (left).

Figure 4: Container as a hold (left), and container as a running process (right)

Once a hold has been granted on a host, the NodeManager launches a process called a task. The right side of Figure 4 shows the task running as a process inside a container. (Part 3 will cover, in more detail, how YARN schedules a container on a particular host.)

YARN Cluster Basics (Running Process/ApplicationMaster)

For the next section, two new YARN terms need to be defined:

  • An application is a YARN client program that is made up of one or more tasks (see Figure 5).
  • For each running application, a special piece of code called an ApplicationMaster helps coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the application starts.

An application running tasks on a YARN cluster consists of the following steps:

  1. The application starts and talks to the ResourceManager for the cluster:

    Figure 5: Application starting up before tasks are assigned to the cluster

  2. The ResourceManager makes a single container request on behalf of the application:

    Figure 6: Application + allocated container on a cluster

  3. The ApplicationMaster starts running within that container:

    Figure 7: Application + ApplicationMaster running in the container on the cluster

  4. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. Those tasks do most of the status communication with the ApplicationMaster allocated in Step 3):

    Figure 8: Application + ApplicationMaster + task running in multiple containers running on the cluster

  5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
  6. The application client exits. (The ApplicationMaster launched in a container is more specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s control. Llama is an example of an unmanaged AM.)

MapReduce Basics

In the MapReduce paradigm, an application consists of Map tasks and Reduce tasks. Map tasks and Reduce tasks align very cleanly with YARN tasks.

 

Figure 9: Application + Map tasks + Reduce tasks

Putting it Together: MapReduce and YARN

Figure 10 illustrates how the map tasks and the reduce tasks map cleanly to the YARN concept of tasks running in a cluster.

Figure 10: Merged MapReduce/YARN Application Running on a Cluster

In a MapReduce application, there are multiple map tasks, each running in a container on a worker host somewhere in the cluster. Similarly, there are multiple reduce tasks, also each running in a container on a worker host.

Simultaneously on the YARN side, the ResourceManager, NodeManager, and ApplicationMaster work together to manage the cluster’s resources and ensure that the tasks, as well as the corresponding application, finish cleanly.

Conclusion

Summarizing the important concepts presented in this section:

  1. cluster is made up of two or more hosts connected by an internal high-speed network. Master hostsare a small number of hosts reserved to control the rest of the cluster. Worker hosts are the non-master hosts in the cluster.
  2. In a cluster with YARN running, the master process is called the ResourceManager and the worker processes are called NodeManagers.
  3. The configuration file for YARN is named yarn-site.xml. There is a copy on each host in the cluster. It is required by the ResourceManager and NodeManager to run properly. YARN keeps track of tworesources on the cluster, vcores and memory. The NodeManager on each host keeps track of the local host’s resources, and the ResourceManager keeps track of the cluster’s total.
  4. container in YARN holds resources on the cluster. YARN determines where there is room on a host in the cluster for the size of the hold for the container. Once the container is allocated, those resources are usable by the container.
  5. An application in YARN comprises three parts:
    1. The application client, which is how a program is run on the cluster.
    2. An ApplicationMaster which provides YARN with the ability to perform allocation on behalf of the application.
    3. One or more tasks that do the actual work (runs in a process) in the container allocated by YARN.
  6. MapReduce application consists of map tasks and reduce tasks.
  7. A MapReduce application running in a YARN cluster looks very much like the MapReduce application paradigm, but with the addition of an ApplicationMaster as a YARN requirement.

Next Time…

Part 2 will cover calculating YARN properties for cluster configuration. In the meantime, consider this further reading:

Ray Chiang is a Software Engineer at Cloudera.

Dennis Dawson is a Senior Technical Writer at Cloudera.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
提供的源码资源涵盖了安卓应用、小程序、Python应用和Java应用等多个领域,每个领域都包含了丰富的实例和项目。这些源码都是基于各自平台的最新技术和标准编写,确保了在对应环境下能够无缝运行。同时,源码中配备了详细的注释和文档,帮助用户快速理解代码结构和实现逻辑。 适用人群: 这些源码资源特别适合大学生群体。无论你是计算机相关专业的学生,还是对其他领域编程感兴趣的学生,这些资源都能为你提供宝贵的学习和实践机会。通过学习和运行这些源码,你可以掌握各平台开发的基础知识,提升编程能力和项目实战经验。 使用场景及目标: 在学习阶段,你可以利用这些源码资源进行课程实践、课外项目或毕业设计。通过分析和运行源码,你将深入了解各平台开发的技术细节和最佳实践,逐步培养起自己的项目开发和问题解决能力。此外,在求职或创业过程中,具备跨平台开发能力的大学生将更具竞争力。 其他说明: 为了确保源码资源的可运行性和易用性,特别注意了以下几点:首先,每份源码都提供了详细的运行环境和依赖说明,确保用户能够轻松搭建起开发环境;其次,源码中的注释和文档都非常完善,方便用户快速上手和理解代码;最后,我会定期更新这些源码资源,以适应各平台技术的最新发展和市场需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值