记录一次YARN资源预留机制导致的故障

leerelisten

已于 2024-07-05 16:58:32 修改

阅读量803

点赞数 19

文章标签：大数据 hadoop yarn

于 2024-07-05 16:57:45 首次发布

本文链接：https://blog.csdn.net/leerelisten/article/details/140213582

版权

故障表象

NM日志

001NM日志

2023-12-04 00:24:37,750 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xff8c2e55cb370042, likely server has closed socket, closing socket connection and attempting reconnect
2023-12-04 00:24:37,850 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED
2023-12-04 00:24:38,600 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv002/192.168.24.167:2181. Will not attempt to authenticate using SASL (unknown error)2023-12-04 00:24:38,603 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:49430, server: bigdata-srv002/192.168.24.167:21812023-12-04 00:24:38,608 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv002/192.168.24.167:2181, sessionid = 0xff8c2e55cb370042, negotiated timeout = 600002023-12-04 00:24:38,608 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: RECONNECTED2023-12-04 00:30:59,038 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xff8c2e55cb370042, likely server has closed socket, closing socket connection and attempting reconnect2023-12-04 00:30:59,039 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xff8c2e4a598d0050, likely server has closed socket, closing socket connection and attempting reconnect2023-12-04 00:30:59,140 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: SUSPENDED2023-12-04 00:30:59,144 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...2023-12-04 00:30:59,144 WARN org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService: Lost contact with Zookeeper. Transitioning to standby in 60000 ms if connection is not reestablished.2023-12-04 00:30:59,263 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv003/192.168.22.26:2181. Will not attempt to authenticate using SASL (unknown error)2023-12-04 00:30:59,264 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:38312, server: bigdata-srv003/192.168.22.26:21812023-12-04 00:30:59,269 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv003/192.168.22.26:2181, sessionid = 0xff8c2e4a598d0050, negotiated timeout = 600002023-12-04 00:30:59,270 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.2023-12-04 00:30:59,274 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/run/cloudera-scm-agent/process/1368-yarn-RESOURCEMANAGER/yarn-site.xml
2023-12-04 00:30:59,278 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn     OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
2023-12-04 00:30:59,278 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state
2023-12-04 00:30:59,278 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn     OPERATION=transitionToStandby   TARGET=RM       RESULT=SUCCESS
2023-12-04 00:30:59,522 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv001/192.168.22.25:2181. Will not attempt to authenticate using SASL (unknown error)
2023-12-04 00:30:59,523 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:35044, server: bigdata-srv001/192.168.22.25:2181
2023-12-04 00:30:59,523 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv001/192.168.22.25:2181, sessionid = 0xff8c2e55cb370042, negotiated timeout = 60000
2023-12-04 00:30:59,524 INFO org.apache.curator.framework.state.ConnectionStateManager: State change: RECONNECTED
2023-12-04 05:11:19,266 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0xff8c2e4a598d0050, likely server has closed socket, closing socket connection and attempting reconnect
2023-12-04 05:11:19,367 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2023-12-04 05:11:19,367 WARN org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService: Lost contact with Zookeeper. Transitioning to standby in 60000 ms if connection is not reestablished.
2023-12-04 05:11:19,677 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv001/192.168.22.25:2181. Will not attempt to authenticate using SASL (unknown error)
2023-12-04 05:11:19,678 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:56986, server: bigdata-srv001/192.168.22.25:2181
2023-12-04 05:11:19,679 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv001/192.168.22.25:2181, sessionid = 0xff8c2e4a598d0050, negotiated timeout = 60000

ZKFC日志

2023-12-04 00:23:47,678 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0xff8c2e4a5a6e0000, closing socket connection and attempting reconnect
2023-12-04 00:23:47,785 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2023-12-04 00:23:48,387 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv001/192.168.22.25:2181. Will not attempt to authenticate using SASL (unknown error)
2023-12-04 00:23:48,388 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:55668, server: bigdata-srv001/192.168.22.25:2181
2023-12-04 00:23:48,393 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv001/192.168.22.25:2181, sessionid = 0xff8c2e4a5a6e0000, negotiated timeout = 10000
2023-12-04 00:23:48,393 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2023-12-04 00:23:48,394 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2023-12-04 00:23:48,396 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a036e7331120a6e616d656e6f646533391a0e626967646174612d73727630303120d63e28d33e
2023-12-04 00:23:48,396 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old node has our own data, so don't need to fence it.
2023-12-04 00:23:48,396 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/ns1/ActiveBreadCrumb to indicate that the local node is the most recent active...
2023-12-04 00:23:48,600 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at bigdata-srv001/192.168.22.25:8022 active...
2023-12-04 00:23:48,678 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at bigdata-srv001/192.168.22.25:8022 to active state
2023-12-04 05:11:19,268 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 8714ms for sessionid 0xff8c2e4a5a6e0000, closing socket connection and attempting reconnect
2023-12-04 05:11:19,467 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2023-12-04 05:11:20,239 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv003/192.168.22.26:2181. Will not attempt to authenticate using SASL (unknown error)
2023-12-04 05:11:20,243 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:60442, server: bigdata-srv003/192.168.22.26:2181
2023-12-04 05:11:20,246 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv003/192.168.22.26:2181, sessionid = 0xff8c2e4a5a6e0000, negotiated timeout = 10000
2023-12-04 05:11:20,247 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2023-12-04 05:11:20,248 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2023-12-04 05:11:20,257 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a036e7331120a6e616d656e6f646533391a0e626967646174612d73727630303120d63e28d33e
2023-12-04 05:11:20,257 INFO org.apache.hadoop.ha.ActiveStandbyElector: But old node has our own data, so don't need to fence it.
2023-12-04 05:11:20,257 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/ns1/ActiveBreadCrumb to indicate that the local node is the most recent active...
2023-12-04 05:11:20,264 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at bigdata-srv001/192.168.22.25:8022 active...
2023-12-04 05:11:20,272 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at bigdata-srv001/192.168.22.25:8022 to active state
2023-12-04 07:56:34,332 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6669ms for sessionid 0xff8c2e4a5a6e0000, closing socket connection and attempting reconnect
2023-12-04 07:56:34,435 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2023-12-04 07:56:35,208 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server bigdata-srv002/192.168.24.167:2181. Will not attempt to authenticate using SASL (unknown error)
2023-12-04 07:56:35,211 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.22.25:43000, server: bigdata-srv002/192.168.24.167:2181
2023-12-04 07:56:35,214 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server bigdata-srv002/192.168.24.167:2181, sessionid = 0xff8c2e4a5a6e0000, negotiated timeout = 10000
2023-12-04 07:56:35,214 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.

NN日志

/user/bd/trTrack/2023-12-04/05/2023-12-04-05.1701639210861.log.tmp
2023-12-04 05:33:30,883 INFO org.apache.hadoop.ipc.Server: IPC Server handler 10 on 8020, call Call#698766 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 192.168.22.31:39550
java.io.IOException: File /user/bd/trTrack/2023-12-04/05/2023-12-04-05.1701639210861.log.tmp could only be written to 0 of the 1 minReplication nodes. There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
        at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2102)
        at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2673)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:872)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:550)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
2023-12-04 05:33:31,017 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  Holder: DFSClient_NONMAPREDUCE_1236272352_27, pending creates: 1], src=/user/bd/adminTrack/2023-12-04/05/2023-12-04-05.1701639202987.log.tmp from client DFSClient_NONMAPREDUCE_1236272352_27
  
  
   org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: StorageInfo TreeSet fill ratio

2. 原因分析

12月2日由于刚升级完海豚调度器，海豚调度器与azkaban调度器两个处于并行阶段，导致短时间内大量任务提交，而海豚调度器由于从单节点升级为多节点，任务并发未进行设置，导致yarn队列被打满，而yarn对列的scheduler使用了Fair Scheduler，存在Memory Reserved机制，即资源预留机制。

任务卡死时，主机内存资源同时被打满，排查是因为yarn资源由于资源预留机制的存在，导致了队列被占满，并且没有运行，处于假死等待资源的状态，而主机资源被海豚调度器吊起的sqoop任务逐渐占满。sqoop等待yarn任务执行完毕后释放资源，而yarn怀疑也等待sqoop任务释放主机资源后执行yarn任务，便导致了资源的相互依赖，最后主机内存资源占用达到100%，Hadoop组件出现问题，导致集群挂掉。

3. 解决方案

2和3节点设置为master-server节点，其中参数exec-threads都设置为10，即同时支持20个工作流运行。
1和2节点设置为worker-server节点，其中参数exec-threads都设置为15，即同时支持30个task运行。
并且同时把yarn队列设置为最大5个任务，问题得到完美解决。

4. 分析总结

本次故障主要还是因为yarn队列使用fair-scheduler方式，存在资源预留机制，造成了本次故障，后续可以考虑使用capacity-scheduler调度解决此种问题。

leerelisten

关注

19
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫