场景
三节点二namenode情况下,一个namenode(nn1)挂掉,手工将另一个name提升为active节点报错:hdfs haadmin -transitionToActive nn2
21/02/27 00:24:16 INFO ipc.Client: Retrying connect to server: hadoop01/192.168.26.10:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Unexpected error occurred Call From hadoop02/192.168.26.20 to hadoop01:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Usage: haadmin [-transitionToActive [--forceactive] <serviceId>]
原因分析:
这是因为在做提升active节点时,需要连接各个namenode确认没有namenode是active状态(防止脑裂),才能提升为active,此时可以强制提升为active节点。
解决方案1:
强制提升:hdfs haadmin -transitionToActive --forceactive nn2
解决方案2:
配置成自动故障转移,在namenode的节点上启zkfc,zkfc会在zookeeper上的指定目录抢占式的建立一个临时节点,并保持锁,建立节点的zkfc所在的namenode为active,zkfc保持监听namenode的状态,若namenode挂掉或者假死,则zkfc会释放锁删除临时节点,让其它zkfc抢占式注册并建立节点(此时会通过杀死原namenode,防止假死的namenode恢复状态,出现两个active的namenode,导致脑裂),其所在节点成为active状态。
hdfs-site.xml中配置
<!-- 自动故障转移 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 配置隔离机制,即同一时刻只能有一台服务器对外响应 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用隔离机制时需要ssh无秘钥登录-->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_rsa</value>
</property>
core-site.xml中配置
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
</property>
重启集群:
stop-dfs.sh --停
hdfs zkfc -formatZK --初始化 此时 zookeeper根节点下会产生 hadoop-ha目录
start-dfs.sh --启动集群 此时可以从zookeeper的节点中看到 此时active 的节点为Hadoop01
遇到的问题:
2021-02-27 11:44:32,775 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to hadoop01
2021-02-27 11:44:32,775 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 9000
2021-02-27 11:44:32,968 WARN org.apache.hadoop.ha.SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 9000 via ssh: bash: fuser: command not found
2021-02-27 11:44:32,968 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127
2021-02-27 11:44:32,968 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from hadoop01 port 22
2021-02-27 11:44:32,968 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2021-02-27 11:44:32,968 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
2021-02-27 11:44:32,968 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving main loop due to Socket closed
2021-02-27 11:44:32,968 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoop01/192.168.26.10:9000
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:921)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:820)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
若没有fuser命令,需要执行以下命令安装:
yum install psmisc