两个stanby的NameNode问题解决

目录

 

问题现象

排查过程

问题分析:

解决:

解决命令:


问题现象

今天测试环境的NameNode在发生gc停顿时间过长后退出,依次重启后发现无法正常的选出active节点,

排查过程

  1. 查看日志并没有zk选举相关的日志
  2. zkfc进程的日志时间停留在出问题的几个小时前
    1. 具体日志:
2019-08-26 10:47:57,925 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001: Call From namenodetest02.bi/10.104.104.128 to namenodetest02.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-08-26 10:47:59,947 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodetest02.bi.10101111.com/10.104.104.128:9001. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2019-08-26 10:47:59,964 WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception trying to monitor health of NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001: Call From namenodetest02.bi/10.104.104.128 to namenodetest02.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2019-08-26 10:48:04,951 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2019-08-26 10:48:04,965 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at namenodetest02.bi.10101111.com/10.104.104.128:9001 entered state: SERVICE_HEALTHY
2019-08-26 10:48:05,174 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.101.22.31:5181,10.104.108.87:5181,10.104.108.88:5181 sessionTimeout=60000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@726d9b25
2019-08-26 10:48:05,963 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.101.22.31/10.101.22.31:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-26 10:48:09,270 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2019-08-26 10:48:09,838 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.104.108.88/10.104.108.88:5181. Will not attempt to authenticate using SASL (unknown error)
2019-08-26 10:48:09,839 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.104.108.88/10.104.108.88:5181, initiating session
2019-08-26 10:48:10,281 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.104.108.88/10.104.108.88:5181, sessionid = 0x26c9ac7569130e4, negotiated timeout = 60000
2019-08-26 10:48:10,593 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2019-08-26 10:48:10,670 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2019-08-26 10:48:10,681 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0e6861646f6f7032636c757374657212036e6e311a1e6e616d656e6f64657465737430312e62692e31303130313131312e636f6d20a94628d33e
2019-08-26 10:48:11,508 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at namenodetest01.bi.10101111.com/10.104.104.127:9001
2019-08-26 10:48:12,941 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodetest01.bi.10101111.com/10.104.104.127:9001. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2019-08-26 10:48:12,957 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at namenodetest01.bi.10101111.com/10.104.104.127:9001 standby (unable to connect)
java.net.ConnectException: Call From namenodetest02.bi/10.104.104.128 to namenodetest01.bi.10101111.com:9001 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
	at org.apache.hadoop.ipc.Client.call(Client.java:1473)
	at org.apache.hadoop.ipc.Client.call(Client.java:1400)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
	at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
	at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
	at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
	at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
	at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
	at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
	at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902)
	at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801)
	at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
	at org.apache.hadoop.ipc.Client.call(Client.java:1439)
	... 14 more
2019-08-26 10:48:13,056 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2019-08-26 10:48:13,062 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2019-08-26 10:48:13,428 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to namenodetest01.bi.10101111.com...
2019-08-26 10:48:13,483 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to namenodetest01.bi.10101111.com port 22
2019-08-26 10:48:13,904 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: SSH-2.0-OpenSSH_5.3
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: SSH-2.0-JSCH-0.1.42
2019-08-26 10:48:14,170 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers: aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256
2019-08-26 10:48:14,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available.
2019-08-26 10:48:14,700 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available.
2019-08-26 10:48:14,723 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available.
2019-08-26 10:48:14,911 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent
2019-08-26 10:48:14,911 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received
2019-08-26 10:48:14,919 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr hmac-md5 none
2019-08-26 10:48:14,919 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr hmac-md5 none
2019-08-26 10:48:15,344 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent
2019-08-26 10:48:15,344 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY
2019-08-26 10:48:15,407 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true
2019-08-26 10:48:15,465 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added 'namenodetest01.bi.10101111.com' (RSA) to the list of known hosts.
2019-08-26 10:48:15,465 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent
2019-08-26 10:48:15,465 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received
2019-08-26 10:48:15,496 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent
2019-08-26 10:48:15,496 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received
2019-08-26 10:48:15,497 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: publickey,keyboard-interactive,password
2019-08-26 10:48:15,497 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: publickey
2019-08-26 10:48:15,931 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey).
2019-08-26 10:48:15,967 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to namenodetest01.bi.10101111.com
2019-08-26 10:48:15,967 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 9001

3、结合之前出现过namenode的内存不足退出及调大NN的堆内存大小后其他进程会退出的情况

问题分析:

可能是namenode节点的内存不足导致的zkfc进程卡死,不能正常的选举出active节点

解决:

重启zkfc进程

解决命令:

hadoop-daemon.sh stop zkfc

hadoop-daemon.sh start zkfc

同样问题现象解决续

测试环境再次出现两个NameNode都是stanby现象,分别重启两个节点的zkfc都没法正常使得namenode切换active成功

解决2:

停止一个节点的NN后,重启另一个节点的zkfc解决

相关报错信息:

pported in state standby
2019-10-17 17:06:02,866 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log roll on remote NameNode namenodetest01.bi.10101111.com/10.
2019-10-17 17:06:02,871 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category JOURNAL is not supported in state standby
	at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
	at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1722)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1352)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:6369)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.rollEditLog(NameNodeRpcServer.java:989)
	at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolServerSideTranslatorPB.rollEditLog(NamenodeProtocolServerSideTranslatorPB.java:142)
	at org.apache.hadoop.hdfs.protocol.proto.NamenodeProtocolProtos$NamenodeProtocolService$2.callBlockingMethod(NamenodeProtocolProtos.java:12025)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
--
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)

	at org.apache.hadoop.ipc.Client.call(Client.java:1469)
	at org.apache.hadoop.ipc.Client.call(Client.java:1400)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
	at com.sun.proxy.$Proxy21.rollEditLog(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.NamenodeProtocolTranslatorPB.rollEditLog(NamenodeProtocolTranslatorPB.java:148)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:271)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$600(EditLogTailer.java:61)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:313)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:412)
	at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
2019-10-17 17:06:03,881 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user hadoop

具体原因暂时没找到。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值