Yarn ResourceManager HA 故障转移问题定位

一.问题描述

因修改了yarn的配置,需要对yarn ResourceManager进行重启,重启完发现两个ResourceManager状态均为standby,用户无法在yarn集群上提交任务, Yarn服务异常。

ResourceManager Exception日志如下:

二.问题定位

  1. 通过Yarn HA机制得知 standby状态的RM会对正在运行的任务尝试恢复,具体过程如下:
    当NM与重新启动的RM重新同步时,NM不会杀死容器。它继续管理容器,并在重新注册时将容器状态发送到RM。

  1. RM通过吸收这些容器的信息来重建容器实例和相关应用程序的调度状态。与此同时AM需要将未完成的资源请求重新发送给RM,因为RM可能会在关闭时丢失未完成的请求。

  1. 使用AMRMClient库与RM通信的应用程序编写者无需担心AM在重新同步时向RM重新发送资源请求的部分,因为它自动由库本身处理。

  1. 查看Yarn正在运行的任务ID application_1606183701564_9494(只有一个任务正在运行)

  1. application_1606183701564_9494(只有一个任务正在运行)

  1. 根据任务ID查看standby 状态下ResourceManager日志

2020-11-26 20:05:02,369 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1606183701564_9494 with 1 attempts and final state = NONE

2020-11-26 20:05:23,123 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Cannot submit application application_1606183701564_9494 to queue root.default because it has zero amount of resource for a requested resource! Invalid requested AM resources: [MaxResourceValidationResult{resourceRequest={AllocationRequestId: -1, Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1, Location: *, Relax Locality: true, Execution Type Request: {Execution Type: GUARANTEED, Enforce Execution Type: false}, Node Label Expression: }, invalidResources=[name: memory-mb, units: Mi, type: COUNTABLE, value: 2048, minimum allocation: 0, maximum allocation: 9223372036854775807, name: vcores, units: , type: COUNTABLE, value: 1, minimum allocation: 0, maximum allocation: 9223372036854775807]}], maximum queue resources: <memory:0, vCores:0>

2020-11-26 20:05:23,126 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:526)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1257)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1266)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1207)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:908)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1078)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2300(RMAppImpl.java:118)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1142)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1083)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:891)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:358)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:552)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1406)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:769)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1159)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1199)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1195)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1195)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:320)
at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:651)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:526)
  1. 观察日志发现任务application_1606183701564_9494的最终状态为None,AM向队列root.default请求资源时没有资源

8.借助搜索引擎进行问题关键字检索

发现CDH的fixed_issues与我们的问题一致

YARN Resource Managers will stay in standby state after failover or startup https://docs.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_634_fixed_issues.html

CDH指向的Yarn社区的Jira链接

Improve error handling when application recovery fails with exception https://issues.apache.org/jira/browse/YARN-7913

Jira大概的描述是这样的:

a.Yarn开启HA 并选择fair-scheduler队列模式

b.修改standby RM的fair-scheduler.xml使得其在故障转移时在相应的队列上无权限or资源申请到资源

c.抛出空指针异常导致standby RM 转换至active失败

因为我们没有对standby RM的fair-scheduler.xml进行修改,所以下一步查看fair-scheduler.xml的内容定位问题

  1. 查看队列root.default的配置 cat /run/cloudera-scm-agent/process/2804-yarn-RESOURCEMANAGER/fair-scheduler.xm

  1. 怀疑是memory-mb=100.0%, vcores=100.0% 无法被识别

  1. 删除root.default队列 maxResources的标签后Yarn RM HA故障转移成功

三.问题总结

  • fair-scheduler.xml中队列root.default的maxResources标签值无法被识别导致队列最大可用资源为0

  • standby RM 启动后,集群中进行中的任务AM会继续向该RM申请资源

  • 新的RM无法在root.default队列申请到资源

  • 查看源码发现此时application的状态为New,因为APP_REJECTED事件还未处理完成(处理完成状态应为Faild),导致该application无法在scheduler找到而抛出空指针异常

四.其他解决方案

  1. 关闭所有的ResourceManager进程

  1. 通过zk客户端查看 sh /opt/cloudera/parcels/CDH/lib/zookeeper/bin/zkCli.sh

  1. ls /rmstore/ZKRMStateRoot/RMAppRoot 目录下

  1. 不为空则使用该命令 deleteall /rmstore/ZKRMStateRoot/RMAppRoot 删除目录文件

  1. 删除完再启动ResourceManager恢复正常
    此种方式是将ZK中需要RM恢复的任务清空实现的,即RM不恢复正在运行的任务,会导致集群正在运行的任务停止

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
Hadoop YARN 中,HA 集群是指将多个 YARN ResourceManager 节点组成一个高可用的集群,以提高系统的可靠性和稳定性。在 HA 集群中,多个 ResourceManager 节点可以相互备份,以保证在某个节点故障时,系统仍能正常运行。 在配置 YARN HA 集群时,可以使用环境变量来设置一些参数,以便更好地控制 HA 集群的行为。下面介绍如何使用环境变量配置 YARN HA 集群。 1. 配置 yarn-site.xml 文件 首先,在 yarn-site.xml 文件中配置 HA 相关的参数。以下是一个示例配置: ``` <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>mycluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>rm1-hostname</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>rm2-hostname</value> </property> ``` 其中: - yarn.resourcemanager.ha.enabled 表示开启 HA 功能; - yarn.resourcemanager.cluster-id 表示 HA 集群的唯一标识符; - yarn.resourcemanager.ha.rm-ids 表示 HA 集群中每个 ResourceManager 的标识符; - yarn.resourcemanager.hostname.rm1 和 yarn.resourcemanager.hostname.rm2 分别表示每个 ResourceManager 的主机名。 2. 配置环境变量 接下来,需要配置环境变量来指定 HA 集群的一些参数。以下是一个示例配置: ``` export HADOOP_YARN_HOME=/usr/local/hadoop-2.7.3 export YARN_CONF_DIR=$HADOOP_YARN_HOME/etc/hadoop export YARN_RESOURCEMANAGER_HA_RM_IDS=rm1,rm2 export YARN_RESOURCEMANAGER_HA_RM-1_HOSTNAME=rm1-hostname export YARN_RESOURCEMANAGER_HA_RM-2_HOSTNAME=rm2-hostname export YARN_RESOURCEMANAGER_HA_CLUSTER_ID=mycluster ``` 其中: - HADOOP_YARN_HOME 表示 YARN 的安装路径; - YARN_CONF_DIR 表示 YARN 的配置文件路径; - YARN_RESOURCEMANAGER_HA_RM_IDS 表示 HA 集群中每个 ResourceManager 的标识符; - YARN_RESOURCEMANAGER_HA_RM-1_HOSTNAME 和 YARN_RESOURCEMANAGER_HA_RM-2_HOSTNAME 分别表示每个 ResourceManager 的主机名; - YARN_RESOURCEMANAGER_HA_CLUSTER_ID 表示 HA 集群的唯一标识符。 3. 启动 YARN 最后,启动 YARN,并检查 HA 集群是否正常工作。可以使用以下命令启动 YARN: ``` $YARN_HOME/sbin/yarn-daemon.sh start resourcemanager ``` 注意,这里的 $YARN_HOME 是指 YARN 的安装路径。启动成功后,可以通过 Web 界面或命令行工具来检查 HA 集群的状态。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

远方有海,小样不乖

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值