暂停了好长一段时间,终于可以继续大数据的学习了,今天要学习的是HDFS集群自动故障切换的知识,学习本部分内容,需要提前了解ZooKeeper和HDFS HA QJM相关知识点。
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:
Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.
The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:
Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.
配置HDFS自动故障切换功能,必须先停止HDFS集群,以下为具体操作步骤:
1、在hadoop01服务器上编辑hdfs-site.xml文件。下面是hdfs-site.xml文件的完整粘贴,蓝色字体部分为本次添加或修改的内容。
[hadoop@hadoop01 hadoop]$ pwd <?xml version="1.0" encoding="UTF-8"?> http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software <!-- Put site-specific property overrides in this file. --> <configuration> <!-- add start 20160712 --> <property> <property> <property> <property> <property> <property> <!-- add end 20160712 --> <!-- add start 20160713 --> <!-- add end 20160713 --> <!-- add start 20160623 --> <!-- add start 20160627 --> |
2、在hadoop01服务器上编辑core-site.xml文件。下面是core-site.xml文件的完整粘贴,蓝色字体部分为本次添加或修改的内容。
[hadoop@hadoop01 hadoop]$ pwd
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software <!-- Put site-specific property overrides in this file. --> <configuration> <!-- add start 20160712 -->
<!--add start 20160627 --> |
3、将hadoop01机器上配置好的 hdfs-site.xml 和 core-site.xml 两个文件复制hadoop02、hadoop03服务器上。
[hadoop@hadoop01 hadoop]$ pwd /home/hadoop/hadoop-2.7.2/etc/hadoop [hadoop@hadoop01 hadoop]$ scp hdfs-site.xml core-site.xml hadoop02:$PWD hdfs-site.xml 100% 2973 2.9KB/s 00:00 core-site.xml 100% 1906 1.9KB/s 00:00 [hadoop@hadoop01 hadoop]$ scp hdfs-site.xml core-site.xml hadoop03:$PWD hdfs-site.xml 100% 2973 2.9KB/s 00:00 core-site.xml 100% 1906 1.9KB/s 00:00 [hadoop@hadoop01 hadoop]$ |
4、启动ZooKeeper服务。
#已zookeeper身份,登录hadoop01服务器 [zookeeper@hadoop01 ~]$ zkServer.sh start #已zookeeper身份,登录hadoop02服务器 [zookeeper@hadoop02 ~]$ zkServer.sh start #已zookeeper身份,登录hadoop03服务器 [zookeeper@hadoop03 ~]$ zkServer.sh start |
5、初始化zookeeper配置,执行以下命令:
[hadoop@hadoop01 ~]$ hdfs zkfc -formatZK 16/07/03 10:25:13 INFO tools.DFSZKFailoverController: Failover controller configured for NameNode NameNode at hadoop01/192.168.0.201:8020 。。。 。。。 Proceed formatting /hadoop-ha/mycluster? (Y or N) Y 16/07/03 10:25:21 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/mycluster from ZK... |
6、执行start-all.sh命令,启动HDFS HA 集群,两台服务器分别为 Active状态和Standby状态。
[hadoop@hadoop01 ~]$ start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh Starting namenodes on [hadoop01 hadoop02] hadoop02: starting namenode, ...... hadoop01: starting namenode, ...... hadoop02: starting datanode, ...... hadoop01: starting datanode, ...... hadoop03: starting datanode, ...... Starting journal nodes [hadoop01 hadoop02 hadoop03] hadoop01: starting journalnode, ...... hadoop02: starting journalnode, ...... hadoop03: starting journalnode, ...... Starting ZK Failover Controllers on NN hosts [hadoop01 hadoop02] hadoop02: starting zkfc, ...... hadoop01: starting zkfc, ...... starting yarn daemons starting resourcemanager, ...... hadoop01: starting nodemanager, ...... hadoop02: starting nodemanager, ...... hadoop03: starting nodemanager, ...... |
7、确认三台服务器上启动的进程情况:
#进入hadoop01服务器执行jps命令 [hadoop@hadoop01 ~]$ jps #进入hadoop02服务器执行jps命令 [hadoop@hadoop02 ~]$ jps #进入hadoop03服务器执行jps命令 [hadoop@hadoop03 ~]$ jps |
8、登录web,查看两台服务器上namenode状态:
.hadoop01服务器上的Namenode状态为:Active
hadoop02服务器上的Namenode状态为:Standby
9、杀死hadoop01服务器上的Namenode进程,确认故障转移可以自动触发。
[hadoop@hadoop01 ~]$ jps|grep NameNode nnnn NameNode [hadoop@hadoop01 ~]$ kill -9 nnnn |
10、模拟故障发生后,检查两台服务器上的Namenode状态:
.hadoop01提示为无法访问。
.查看hadoop02服务器,Namenode已自动切换为 Active。
11、单独启动Namenode命令
[hadoop@hadoop01 ~]$ hadoop-daemon.sh start namenode starting namenode, logging to /home/hadoop/hadoop-2.7.2//logs/hadoop-hadoop-namenode-hadoop01.out |
转载于:https://blog.51cto.com/sjinqun/1862267