Hadoop空闲时无法提交任务

一、问题描述

在用hive提交MR任务时,发现在队列空闲时,提交的application无法能够进入RUNNING,一直处于ACCEPTED。查看日志发现在6.8号也在报相同错误(如下)
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException

二、原因

因为yarn集群资源充足,排除因内存不足导致application pending的原因。
通过查看yarn相关的最新日志,发现已过期的application状态及RM相关信息(其存储在内存或ZK目录上,也用来保证RM的高可用,防止脑裂)未被及时清理,
发现可能是内存或zookeeper保存每次提交application相关的state和RM相关信息的数量超过zookeeper的阈值所致。
然后去查看yark-site.xml的zookeeper相关的配置去验证,发现超过yarn-site.xml相关参数阈值,故需要对过期application的状态进行

1.查看日志(10.1.171.20)

  • vim /data/program/hadoop-3.0.0-cdh6.3.1/logs/hadoop-appweb-resourcemanager-stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc.log
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException
2020-06-08 08:10:40,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker

2.查看yarn-site.xml相关配置,发现application相关RM状态存储在zk上

 <property>
    <name>yarn.resourcemanager.store.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>

3.查看日志报错时相关的阈值,

并查看zookeeper对应的application数量,发现已经超过起阈值1000,故认为时其导致任务无法提交成功,着手进行清楚其状态
Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000
Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000

三、解决

1.登陆zkCli.sh,查看ZKRMStateStore对应的目录application的数量

$  echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{print NF}'

1002

2.进行脚本清理无效、过期的application.将过期的application状态与zk命令进行拼接,然后进行批量删除

echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" |  /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | while read item; do echo ${item#*[}; done | while read item; do echo ${item%*]}; done | awk -F ', ' '{ for (i=1;i<=NF;i++) printf "rmr /rmstore/ZKRMStateRoot/RMAppRoot/%s\n",$i}' > /tmp/deleteNode.txt

3.执行批量删除

cat /tmp/deleteNode.txt | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh 

四、清理过期application状态,再次提交hive sql ,可以提交成功

五、总结 出现问题,根据Hadoop完善的日志记录流程去定位问题,或根据以往经验或和相关日志信息进行定位问题,进而确定解决方向

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值