记录一次生产异常,spark driver 连接RM报错,不断尝试重连接,报错如下:
21/04/16 17:00:05 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "prod-hadoop01/172.19.51.11"; destination host is: "prod-hadoop03":8032; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking $Proxy16.getApplicationReport over Failover proxy for [rm1, rm2]. Trying to failover immediately.
21/04/16 17:00:05 INFO RequestHedgingRMFailoverProxyProvider: Connection lost with rm2, trying to fail over.
21/04/16 17:00:05 INFO RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
21/04/16 17:00:05 INFO RetryInvocationHandler: java.net.ConnectException: Call From prod-hadoop01/172.19.51.11 to prod-hadoop03:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getApplicationReport over null. Retrying after sleeping for 15000ms.
由于是生产,客户正在使用该功能,只能快速定位问题:“prod-hadoop03”:8032,根据日志提示是从prod-hadoopo1到prod-hadoop03,而且是8032端口,很快判断可能是ResourceManager HA切换导致的问题,快速进入ambari停止RM,重新启动RM,是active ResourceManager 切换到prod-hadoop01 上,暂时解决生产问题。