# Apache Hadoop Yarn: Yet Another Resource Negotiator论文解读

纯属云平台管理学习菜鸟的笔记,参照许多大牛的博客,如有侵权,请联系,立刻删除。

Abstract

1) tight coupling of a specific programming model with the re- source management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs’ control flow, which resulted in endless scalability concerns for the scheduler.
个人理解:Yarn之前的Hadoop资源管理调度和编程模型紧密耦合在一起,使成程序员的编程思维方式局限于MapReduce,不利于其他编程模型的加入。资源管理和作业控制集中在Jobtracker身上,扩展性较差。
Yarn 架构致力于编程模型和资源管理解耦,并把编程模型托付给其他一些调度部分。

1 Introduction
We present the next generation of Hadoop compute platform known as YARN, which departs from its familiar, monolithic architecture. By separating resource management functions from the programming model, YARN delegates many scheduling-related functions to per-job components.

个人理解:资源调度架构方的转变——Yarn把Hadoop 1.0的中央式调度器(Monolithic scheduler)方式过渡到双层调度器(Two-level scheduler)方式。RM 作为一个轻量级的中央调度器,仅负责资源分配管理,位于顶层。底层则为各种应用程序的调度器。不同的编程模型组成不同的应用程序调度器AM。这些负责向RM 申请资源和控制各自作业的独立运行。

2 History and rationale

从Yahoo! 的工程实践引出Hadoop 1.0的不足之处,从而产生需要改进的10大需求。

需求个人理解
1 Scalability单一Namenode,单一Jobtracker的设计严重制约了整个Hadoop 1.0可扩展 性和可靠性。首先,Namenode和Jobtracker是整个系统中明显的单点故障源(SPOF)。再次单一Namenode的内存容量有限,使得Hadoop集群的节点数量被限制到2000个左右,能支持的文件系统大小被限制在10-50PB, 最多能支持的文件数量大约为1.5亿 左右(注,实际数量取决于 Namenode的内存大小)。
2 Multi-tenancy
3 Serviceability升级依赖需要解耦
4 Locality awareness资源备份需要更加合理
5 High Cluster Utilization集群资源分配的延迟太高,需要提高集群资源利用率
6 Reliability/Availability可靠性
7 Secure and auditable operation安全性
8 Support for Programming Model Diversity支持多种编程模型
9 Flexible Resource Model资源分配需要更加灵活 The number of map and reduce slots are fixed by the cluster operator, so fallow map capacity can’t be used to spawn reduce tasks and vice versa.4
10 Backward compatibility版本兼容性

3 Architecture

The RM runs as a daemon on a dedicated machine, and acts as the central authority arbitrating resources among various competing applications in the cluster.
Depending on the application demand, scheduling priori ties, and resource availability, the RM dynamically allocates leases– called containersto applications to run on particular nodes.5 The container is a logical bundle of re-sources (e.g., ⟨2GB RAM, 1 CPU⟩) bound to a particular node [R4,R9]. In order to enforce and track such assignments, the RM interacts with a special system daemon running on each node called the NodeManager (NM).


Jobs are submitted to the RM via a public submission protocol and go through an admission control phase during which security credentials are validated and various operational and administrative checks are performed [R7].
The ApplicationMaster is the “head” of a job, managing all lifecycle aspects including dynamically increasing and decreasing resources consumption, managing the flow of execution (e.g., running reducers against the output of maps), handling faults and computation skew, and performing other local optimizations.
Typically, an AM will need to harness the resources (cpus, RAM, disks etc.) available on multiple nodes to complete a job. To obtain containers, AM issues resource requests to the RM.
Overall, a YARN deployment provides a basic, yet robust infrastructure for lifecycle management and monitoring of containers, while application-specific semantics are managed by each framework [R3,R8].

RM 作为一个全局的资源管理器,负责整个系统的资源管理和分配。根据应用程序的资源申请分配容器资源给特定的节点使用。NM是每个节点上的资源和任务管理器,它与RM之间使用心跳包的方式进行通信,向RM通告资源的使用状况。另一方面,它接收并处理来自AM的Container 启动或停止请求。Container是Yarn中资源逻辑单位,封装了某个节点的资源,如内存和cpu等。AM负责作业整个生命周期的控制,包括给作业申请资源、管理作业运行的整个流程等。RM给AM分配一个带租期的的容器,一个基于token的安全机制确保AM拥有某个节点的容器。不同编程模型可开发出不同的AM,实现了对多种编程模型的支持。

Resource Manager (RM)
The ResourceManager exposes two public interfaces towards: 1) clients submitting applications, and 2) ApplicationMaster(s) dynamically negotiating access to resources, and one internal interface towards NodeManagers for cluster monitoring and resource access management.

两个向外提供的接口: 客户端提交应用程序交互接口及与AM动态协调分配调度资源接口
一个内部接口: 与NM通信,管理NM的资源使用。

As discussed, it is not responsible for coordinating application execution or task fault-tolerance, but neither is is charged with 1) providing status or metrics for running applications (now part of the ApplicationMaster), nor 2) serving frame- work specific reports of completed jobs (now delegated to a per-framework daemon).10 This is consistent with the view that the ResourceManager should only handle live resource scheduling, and helps central components in YARN scale beyond the Hadoop 1.0 JobTracker.

简而言之,RM中的scheduler 与jobtracker不同,它不再参与应用程序的执行监控和跟踪,也不负责重新启动因应用执行失败或者硬件故障而产生的失败任务,这些均交给应用程序相关的AM完成。

Application Master (AM)

An application may be a static set of processes, a logical description of work, or even a long-running service. The ApplicationMaster is the process that coordinates the application’s execution in the cluster, but it itself is run in the cluster just like any other container.

The AM periodically heartbeats to the RM to affirm its liveness and to update the record of its demand.

In response to subsequent heartbeats, the AM will receive a container lease on bundles of resources bound to a particular node in the cluster. Based on the containers it receives from the RM, the AM may update its execution plan to accommodate perceived abundance or scarcity. In contrast to some resource models, the allocations to an application are late binding: the process spawned is not bound to the request, but to the lease. The conditions that caused the AM to issue the request may not remain true when it receives its resources, but the semantics of the container are fungible and framework-specific [R3,R8,R10].

AM会周期性地向RM发送心跳以表明其存在,并更新它的资源要求。在收到由RM分配的containers后,AM会根据资源的多寡来调整其任务的执行计划。分配给应用程序的资源是推迟绑定的。在考虑资源可用性与调度策略的情况下,RM会试图满足每一个app提出的resource request。当一个资源被调度给了某个AM后,RM会为该资源生成一个租约(lease),随后的AM heartbeat会获取到该lease。当AM向NM递呈container lease时,一个基于令牌的安全机制会保证该container lease的真实性。

Node Manager (NM)
The NodeManager is the “worker” daemon in YARN. It authenticates container leases, manages containers’ dependencies, monitors their execution, and provides a set of services to containers.

The NM will also kill containers as directed by the RM or the AM.

NM also periodically monitors the health of the physical node.

NM在Yarn中作为一个工作节点存在,它负责验证容器租约、管理容器依赖关系和监视应用程序的执行。同时,它会周期性通过心跳信息向RM报告它的资源可用性。

4 Yarn in the real-world

While one of the initial goals of YARN was to improve scalability, Yahoo! reported that they are not running clusters any bigger than 4000 nodes which used be the largest cluster’s size before YARN.
This has simply removed the need to scale further for the moment, and has even allowed the operational team to delay the re-provisioning of over 7000 nodes that have been decommissioned.

Essentially, moving to YARN, the CPU utilization almost doubled to the equivalent of 6 continuously pegged cores per box and with peaks up to 10 fully utilized cores. In aggregate, this indicates that YARN was capable of keeping about 2.8*2500 = 7000 more cores completely busy running user code. This is consistent with the increase in number of jobs and tasks running on the cluster we discussed above.

This is well summarized in the following quote: “upgrading to YARN was equivalent to adding 1000 machines [to this 2500 ma- chines cluster]”.

5 experiments
Beating the sort record

7 Conclusion

Thanks to the decoupling of resource management and program- ming framework, YARN provides: 1) greater scalability, 2) higher efficiency, and 3) enables a large number of different frameworks to efficiently share a cluster. These claims are substantiated both experimentally (via benchmarks)

优势:

  • 高扩展性
  • 集群利用效率高
  • 许多不同的计算框架可以共享集群资源

    缺点:

  • 1各个框架无法知道整个集群的实时资源使用情况,只是被动地接受资源,等待顶层的调度推送信息。

  • 2采用悲观锁,并发粒度小,响应稍慢,缺乏一种有效的竞争机制。
一个client请求响应过程。

•   Step 1: client首先通知RM自己希望提交app
•   Step 2: 随后RM会响应一个ApplicationID及关于当前系统资源容量的信息(供client发起资源请求时作为参考)
•   Step 3: client回应“Application Submission Context” 及 “Container Launch Context (CLC)”。app submission context 包含Application ID, user, queue, 以及启动AM所需的其他信息;CLC中包含了resource requirements, job files, security tokens,以及在一个node上启动AM所需的其他信息
•   Step 4: 当RM收到app submission context后,将会试图为AM调度分配一个可用的container(该container被称为“container 0”,它就是AM,并且它后续将继续请求更多的containers)。如果没有可用的container,该请求将会等待。如果有可用的container,RM会选择并联系一个node,然后在该node上启动AM。并且,用于监控app状态的AM RPC port与tracking URL将被建立起来
•   Step 5: RM向AM回送关于集群中maximum and minimum capabilities的信息。此时,AM必须决定怎样来使用这些可用资源。从这里可以看出,YARN允许app适应当前的集群环境
•   Step 6: 基于RM在step 5中回送的关于当前集群可用资源的信息,AM将请求若干containers
•   Step 7: 随后,RM将根据调度策略对此请求进行回应,并将containers分配给AM

当作业开始运行后,AM将向RM发送心跳/进度信息。在这些心跳信息中,AM可以请求更多的containers,也可以释放containers。当作业运行完毕后,AM向RM发送Finish消息后退出。

参考文献:
Apache Hadoop Yarn: Yet Another Resource Negotiator
http://www.cnblogs.com/zwCHAN/p/4240539.html
spark 笔记 4:Apache Hadoop YARN: Yet Another Resource Negotiator
https://www.zybuluo.com/xtccc/note/248181
YARN Architecture
http://geek.csdn.net/news/detail/74234
十大主流集群调度系统大盘点

《Hadoop 技术内幕 深入解析YARN架构设计与实现原理》董西成

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值