1 集群配置与规划
HDFS-HA自动故障转移配置参考:https://blog.csdn.net/weixin_38023225/article/details/101346493
master-node | slave-node1 | slave-node2 |
NameNode JournalNode DataNode ZK
NodeManager | NameNode JournalNode DataNode ZK ResourceManager NodeManager |
JournalNode DataNode ZK
NodeManager |
2 问题描述
我本地构建了3个节点的集群,集群规划见上表
启动集群,查看web页面显示正常(master-node为active,slave-node1为standby)
验证:将Active NameNode进程kill
[caimh@master-node hadoop-2.7.4]$ jps
4401 NameNode
4913 Jps
3797 JournalNode
4505 DataNode
4218 QuorumPeerMain
4847 DFSZKFailoverController
[caimh@master-node hadoop-2.7.4]$ kill -9 4401
[caimh@master-node hadoop-2.7.4]$ jps
3797 JournalNode
4505 DataNode
4218 QuorumPeerMain
4955 Jps
4847 DFSZKFailoverController
再次查看web页面,master-node(nn1)已经死掉了,slave-node(nn2)仍然是standby,没有转为active
3 原因分析
通过查看日志分析,判定为ssh免密登陆没有配完整。之前只配了master-node到slave-node1的免密登陆,没有配slave-node1到master-node的免密登陆,需要配置互相免密。
4 处理措施
配置slave-node1到master-node的免密登陆
[caimh@slave-node1 ~]$ ssh-keygen -t rsa --4个enter
[caimh@slave-node1 ~]$ ssh-copy-id master-node
[caimh@slave-node1 ~]$ ssh master-node
Last login: Wed Sep 25 06:47:36 2019 from slave-node1
5 验证
关掉hdfs
重启hdfs
查看nn1(active),nn2(standby)
杀掉nn1(active),查看nn2(自动变为active),验证成功
[caimh@master-node hadoop-2.7.4]$ sbin/stop-dfs.sh --关掉hdfs
[caimh@master-node hadoop-2.7.4]$ jps
8790 Jps
6937 QuorumPeerMain
[caimh@master-node hadoop-2.7.4]$ sbin/start-dfs.sh --重新启动hdfs
[caimh@master-node hadoop-2.7.4]$ jps --nn1
9285 JournalNode
9045 DataNode
9527 Jps
8936 NameNode
6937 QuorumPeerMain
9453 DFSZKFailoverController
[caimh@slave-node1 hadoop-2.7.4]$ jps --nn2
4594 QuorumPeerMain
7139 DataNode
7509 Jps
7369 DFSZKFailoverController
7066 NameNode
7229 JournalNode
[caimh@master-node hadoop-2.7.4]$ kill -9 8936 --杀掉nn1
[caimh@master-node hadoop-2.7.4]$ bin/hdfs haadmin -getServiceState nn2 --查看nn2状态,已切换为active
active