【概述】
对于配置了HA模式的RM或者NN,客户端如果向standby的节点发送请求,会因为不可连接或standby拒绝提供服务导致请求失败,转而向Active的节点发送请求,这个转换是hadoop客户端内部自动完成的,无须上层业务感知(本质上是向其中一个节点发送请求,如果失败则继续向另外一个节点发送请求)。
上周排查了一个相关的问题,在集群正常的情况下,向两个节点发送请求都失败,并且是持续失败,从而陷入死循环。最后发现是hadoop内部RPC机制的问题,并且在2.X版本中,该问题都是存在的。本文就来聊聊这个问题。
【问题现象】
某天,上层业务部分的兄弟反馈了一个问题,其现象是yarn client请求某个应用(application)的状态失败。
了解到问题现象后,首先查看了两个RM的日志,并未发现有什么错误的日志信息;接着通过命令行与yarn client分别尝试获取了"有问题"application的状态,发现也都是可以正确获取到的。
再次与该兄弟沟通后发现只有该application有问题,其他application都能正确获取到。同时给出了该application获取时的报错信息
22/06/20 20:48:06 DEBUG ipc.Client: IPC Client (1291113768) connection to resourcemanager-0/172.16.55.7:8032 from 28573: closed
22/06/20 20:48:06 TRACE ipc.ProtobufRpcEngine: 1: Exception <- resourcemanager-0/172.16.55.7:8032: getApplicationReport {java.net.ConnectException: Call From pvs285731713/10.33.72.132 to resourcemanager-0:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused}
22/06/20 20:48:06 TRACE retry.RetryInvocationHandler: Call#-2: ApplicationBaseProtocol.getApplicationReport([application_id { id: 1 cluster_timestamp: 1655720645233 }])
java.net.ConnectException: Call From pvs285731713/10.33.72.132 to resourcemanager-0:8032 failed on connection exception: java.net.ConnectException: Connection refused: no further information; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:827)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1553)
at org.apache.hadoop.ipc.Client.call(Client.java:1495)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getAp