故障发现经过
当时有业务方反馈,新的flink任务无法提交,而我们的flink以on yarn 的模式跑在hadoop yarn上,根据客户端日志发现任务没有提交成功,然后去查看RM日志
RM日志
2019-08-05 16:11:02,101 WARN org.apache.zookeeper.ClientCnxn: Session 0x269ab678e96f411 for server 10.60.96.130/10.60.96.130:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:117)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2019-08-05 16:11:02,302 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /bi-rmstore/ZKRMStateRoot/RMAppRoot/application_1551528254601_0725/appattempt_1551528254601_0725_000001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1058)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1055)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1128)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1161)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:1055)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:712)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:243)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:226)
at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:812)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:872)
at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:867)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
at java.lang.Thread.run(Thread.java:745)
2019-08-05 16:11:02,302 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying opertion on ZK. Retry on. 63
从日志可以很明显的看出来是与zk断链导致的问题,然后首先怀疑是不是zk出了什么问题
很自然的我们会通过客户端去链接zk,然后发现zk是好的,没啥大问题,WHAT? zk好的为什么RM会跟zk断链,带着这个疑问,我们去查看了zk日志
通过zk日志我们也发现了很多 Closed socket connection 的关键字,很明显链接被断开了
ZK日志
2019-08-05 16:12:02,103 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@380] - Exception causing close of session 0x2790d078e58be8b due to java.io.IOException: Len error 1899056
2019-08-05 16:12:02,103 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1091] - Closed socket connection for client /10.216.2.25:41586 which had sessionid 0x2790d078e58be8b
日志上有一个很明显的 Len error 1899056,看来这里心里就大概有数了
zk数据通信默认有一个限制,一次请求不能超过1M,这个值一看就超过了1M,正式因为这个,导致链接被断开,通信失败
那么为什么请求会那么大呢?居然有1M多,原因是这个yarn集群的配置中将RMSTATE存在了zk中,而当任务比较多的时候,zk中的数据也会变得越来越多,解决这个问题也很简单,就是把1M改大一些,不过这个需要客户端和服务端一起改,而且必须保证一致
-
ResouceManager (ZK客户端)
hadoop集群修改 yarn-env.sh 对参数YARN_OPTS 追加 - Djute.maxbuffer=41943040 -
flink客户端 (ZK客户端)
修改flink客户端中的 flink-conf.yaml 新加一行
env.java.opts: -Djute.maxbuffer=41943040 -
ZK服务端
zk集群修改 zkEnv.sh 对参数 JVMFLAGS 追加 -Djute.maxbuffer=41943040
然后重启zk, 重启rm后,恢复正常
后续
本次故障处理过程中,并未修改yarn配置,RMSTATE仍然存放在zk中,虽然这次的问题解决了,但也给未来挖了一个坑,半年之后由于这个又产生一个新的故障,这个后面我会再写一篇来描述,有兴趣的也可以看看。