一、问题描述
在用hive提交MR任务时,发现在队列空闲时,提交的application无法能够进入RUNNING,一直处于ACCEPTED。查看日志发现在6.8号也在报相同错误(如下)
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException
二、原因
因为yarn集群资源充足,排除因内存不足导致application pending的原因。
通过查看yarn相关的最新日志,发现已过期的application状态及RM相关信息(其存储在内存或ZK目录上,也用来保证RM的高可用,防止脑裂)未被及时清理,
发现可能是内存或zookeeper保存每次提交application相关的state和RM相关信息的数量超过zookeeper的阈值所致。
然后去查看yark-site.xml的zookeeper相关的配置去验证,发现超过yarn-site.xml相关参数阈值,故需要对过期application的状态进行
1.查看日志(10.1.171.20)
- vim /data/program/hadoop-3.0.0-cdh6.3.1/logs/hadoop-appweb-resourcemanager-stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc.log
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1573631365527_158284,name=select count(*) from ...ult.hms_per5min_dual(Stage-1),user=root,queue=default,state=FINISHED,trackingUrl=http://stanlee-171-20-hzqsh.node.hzqsh.wacai.sdc:8088/proxy/application_1573631365527_158284/,appMasterHost=stanlee-171-21-hzqsh.node.hzqsh.wacai.sdc,submitTime=1591575015466,startTime=1591575015494,finishTime=1591575030220,finalStatus=SUCCEEDED,memorySeconds=40709,vcoreSeconds=27,preemptedMemorySeconds=0,preemptedVcoreSeconds=0,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=MAPREDUCE,resourceSeconds=40709 MB-seconds\, 27 vcore-seconds,preemptedResourceSeconds=0 MB-seconds\, 0 vcore-seconds
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000, removing app application_1573631365527_157284 from state store.
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000, removing app application_1573631365527_157284 from memory:
2020-06-08 08:10:36,714 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing info for app: application_1573631365527_157284
2020-06-08 08:10:37,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
java.lang.NullPointerException
2020-06-08 08:10:40,188 ERROR org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker
2.查看yarn-site.xml相关配置,发现application相关RM状态存储在zk上
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
3.查看日志报错时相关的阈值,
并查看zookeeper对应的application数量,发现已经超过起阈值1000,故认为时其导致任务无法提交成功,着手进行清楚其状态
Application should be expired, max number of completed apps kept in memory met: maxCompletedAppsInMemory = 1000
Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 1000
三、解决
1.登陆zkCli.sh,查看ZKRMStateStore对应的目录application的数量
$ echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | awk -F , '{print NF}'
1002
2.进行脚本清理无效、过期的application.将过期的application状态与zk命令进行拼接,然后进行批量删除
echo "ls /rmstore/ZKRMStateRoot/RMAppRoot" | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh | grep application_ | while read item; do echo ${item#*[}; done | while read item; do echo ${item%*]}; done | awk -F ', ' '{ for (i=1;i<=NF;i++) printf "rmr /rmstore/ZKRMStateRoot/RMAppRoot/%s\n",$i}' > /tmp/deleteNode.txt
3.执行批量删除
cat /tmp/deleteNode.txt | /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/zookeeper/bin/zkCli.sh
四、清理过期application状态,再次提交hive sql ,可以提交成功
五、总结 出现问题,根据Hadoop完善的日志记录流程去定位问题,或根据以往经验或和相关日志信息进行定位问题,进而确定解决方向