CDH6.3.2中yarn的resourceManager主备重启故障

1.日志

2023-02-17 19:07:51,483 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0391,name=com.sea.servicerunning.scheduler.Data2Mysql,user=root,queue=root.users.root,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0391/,appMasterHost=N/A,submitTime=1676523646022,startTime=1676523646022,finishTime=1676523688554,finalStatus=SUCCEEDED,memorySeconds=599377,vcoreSeconds=258,preemptedMemorySeconds=599377,preemptedVcoreSeconds=258,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=599377 MB-seconds\, 258 vcore-seconds,preemptedResourceSeconds=599377 MB-seconds\, 258 vcore-seconds
2023-02-17 19:07:51,483 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1676509052500_0605_000001
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0392
2023-02-17 19:07:51,484 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: application_1676509052500_0605 final state (FAILED) was recorded, but appattempt_1676509052500_0605_000001 final state (null) was not recorded.
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0392,name=com.sgcc.ami.dev.DevStat.EqStatScheduler,user=root,queue=root.users.root,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0392/,appMasterHost=N/A,submitTime=1676523662675,startTime=1676523662675,finishTime=1676523861916,finalStatus=SUCCEEDED,memorySeconds=14498412,vcoreSeconds=6351,preemptedMemorySeconds=14498412,preemptedVcoreSeconds=6351,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=14498412 MB-seconds\, 6351 vcore-seconds,preemptedResourceSeconds=14498412 MB-seconds\, 6351 vcore-seconds
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1676509052500_0605_000001 State change from NEW to FAILED on event = RECOVER
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0393
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0393,name=com.sgcc.collquality.job.day.ACollOrgReadInteDayJob,user=root,queue=root.users.spark,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0393/,appMasterHost=N/A,submitTime=1676523704652,startTime=1676523704652,finishTime=1676523774850,finalStatus=SUCCEEDED,memorySeconds=5269952,vcoreSeconds=3966,preemptedMemorySeconds=5269952,preemptedVcoreSeconds=3966,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=5269952 MB-seconds\, 3966 vcore-seconds,preemptedResourceSeconds=5269952 MB-seconds\, 3966 vcore-seconds
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1676509052500_0606 with 1 attempts and final state = NONE
2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0394
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1676509052500_0606_000001 with final state = NONE
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1676509052500_0606_000001
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0394,name=com.sgcc.collquality.job.day.ACollDevFailDetJob,user=root,queue=root.users.spark,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0394/,appMasterHost=N/A,submitTime=1676523704494,startTime=1676523704494,finishTime=1676524318402,finalStatus=SUCCEEDED,memorySeconds=182107000,vcoreSeconds=118526,preemptedMemorySeconds=182107000,preemptedVcoreSeconds=118526,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=182107000 MB-seconds\, 118526 vcore-seconds,preemptedResourceSeconds=182107000 MB-seconds\, 118526 vcore-seconds
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1676509052500_0606_000001
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0395
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0395,name=com.sgcc.collquality.job.day.ACollDevDetHiveToOracleDay,user=root,queue=root.users.spark,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0395/,appMasterHost=N/A,submitTime=1676523704539,startTime=1676523704539,finishTime=1676528761139,finalStatus=SUCCEEDED,memorySeconds=431362562,vcoreSeconds=327881,preemptedMemorySeconds=431362562,preemptedVcoreSeconds=327881,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=431362562 MB-seconds\, 327881 vcore-seconds,preemptedResourceSeconds=431362562 MB-seconds\, 327881 vcore-seconds
2023-02-17 19:07:51,485 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Queue root.users.root cannot handle resource requestbecause it has zero available amount of resource for a requested resource type, so the resource request is ignored! Requested resources: <memory:2048, vCores:1>, maximum queue resources: <memory:0, vCores:0>
2023-02-17 19:07:51,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0396
2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Cannot submit application application_1676509052500_0606 to queue root.users.root because it has zero amount of resource for a requested resource! Invalid requested AM resources: [MaxResourceValidationResult{resourceRequest={AllocationRequestId: -1, Priority: 0, Capability: <memory:2048, vCores:1>, # Containers: 1, Location: *, Relax Locality: true, Execution Type Request: {Execution Type: GUARANTEED, Enforce Execution Type: false}, Node Label Expression: }, invalidResources=[name: memory-mb, units: Mi, type: COUNTABLE, value: 2048, minimum allocation: 0, maximum allocation: 9223372036854775807, name: vcores, units: , type: COUNTABLE, value: 1, minimum allocation: 0, maximum allocation: 9223372036854775807]}], maximum queue resources: <memory:0, vCores:0>
2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary: appId=application_1676509052500_0396,name=com.sgcc.collquality.job.day.ACollTmnlPfInteDayJob,user=root,queue=root.users.spark,state=FINISHED,trackingUrl=http://hadoop067.nari.com:28088/proxy/application_1676509052500_0396/,appMasterHost=N/A,submitTime=1676523704385,startTime=1676523704385,finishTime=1676524075647,finalStatus=SUCCEEDED,memorySeconds=31048377,vcoreSeconds=23586,preemptedMemorySeconds=31048377,preemptedVcoreSeconds=23586,preemptedAMContainers=0,preemptedNonAMContainers=0,preemptedResources=<memory:0\, vCores:0>,applicationType=SPARK,resourceSeconds=31048377 MB-seconds\, 23586 vcore-seconds,preemptedResourceSeconds=31048377 MB-seconds\, 23586 vcore-seconds
2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1676509052500_0606_000001
2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=Application Finished - Succeeded    TARGET=RMAppManager    RESULT=SUCCESS    APPID=application_1676509052500_0397
2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 605 out of 1605 applications
2023-02-17 19:07:51,487 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:526)  at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:651)
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:526)

因为日志是循环的,需要注意的日志信息(意思就是yarn的任务队列申请不到资源 状态为none,然后resourceManager就起不来了(这个后续会解释)

2023-02-17 19:07:51,484 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1676509052500_0606 with 1 attempts and final state = NONE

2023-02-17 19:07:51,486 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 605 out of 1605 applications
2023-02-17 19:07:51,487 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to load/recover state
java.lang.NullPointerException:

     at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:526)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1257)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
    at org.apac

2.解决方案

1.关闭所有的ResourceManager进程

        运行zk客户端

        rmr /rmstore/ZKRMStateRoot/RMAppRoot/对应的任务id        

        此种方式会导致yarn集群正在运行的任务停止

2.CDH本身的bug

        查看yarn是否配置了动态资源池配置

        如果其中的队列不能满足该队列下所有任务的恢复,那么这个任务的状态就会变为none也就是

application_1676509052500_0606 with 1 attempts and final state = NONE

 相反如果资源满足,resourceManager则正常重启,因为我们集群是分了多个队列,部分队列资源不够,因而导致yarn的resourceManager起不来,后续把多个队列改为了默认队列,资源满足,resourceManager正常启动,后续查看Cloudera官方文档发现6.3.2本身就存在此bug升级到6.4以上此bug才修复

3.BUG

6.3.2只有一个

6.3.2中并没有提到 yarn的故障

 6.3.4

 点击Fixed Issues 问题中

 

 翻译后就是  YARN资源管理器在故障转移或启动后将保持待机状态

6.3.4中却修复了6.3.2中没提到的yarn故障问题,你们自己品

参考网址CDH 6 Release Notes | 6.x | Cloudera Documentation

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

远方xyd

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值