大数据：从入门到XX(八)

最新推荐文章于 2024-07-25 09:48:00 发布

weixin_34186950

最新推荐文章于 2024-07-25 09:48:00 发布

阅读量94

点赞数

文章标签：大数据 java 运维

原文链接：http://blog.51cto.com/sjinqun/1862267

版权

暂停了好长一段时间，终于可以继续大数据的学习了，今天要学习的是HDFS集群自动故障切换的知识，学习本部分内容，需要提前了解ZooKeeper和HDFS HA QJM相关知识点。

Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS failover relies on ZooKeeper for the following things:

Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover should be triggered.
Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active. If the current active NameNode crashes, another node may take a special exclusive lock in ZooKeeper indicating that it should become the next active.

The ZKFailoverController (ZKFC) is a new component which is a ZooKeeper client which also monitors and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is responsible for:

Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command. So long as the NameNode responds in a timely fashion with a healthy status, the ZKFC considers the node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as unhealthy.
ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in ZooKeeper. If the local NameNode is active, it also holds a special “lock” znode. This lock uses ZooKeeper’s support for “ephemeral” nodes; if the session expires, the lock node will be automatically deleted.
ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has “won the election”, and is responsible for running a failover to make its local NameNode active. The failover process is similar to the manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode transitions to active state.

配置HDFS自动故障切换功能，必须先停止HDFS集群，以下为具体操作步骤：

1、在hadoop01服务器上编辑hdfs-site.xml文件。下面是hdfs-site.xml文件的完整粘贴，蓝色字体部分为本次添加或修改的内容。

[hadoop@hadoop01 hadoop]$ pwd
/home/hadoop/hadoop-2.7.2/etc/hadoop
[hadoop@hadoop01 hadoop]$ vi hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<property>
     <name>dfs.nameservices</name>
     <value>mycluster</value>
   </property>
   <property>
     <name>dfs.ha.namenodes.mycluster</name>
     <value>nn1,nn2</value>
   </property>

   <property>
     <name>dfs.namenode.rpc-address.mycluster.nn1</name>
     <value>hadoop01:8020</value>
   </property>
   <property>
     <name>dfs.namenode.rpc-address.mycluster.nn2</name>
     <value>hadoop02:8020</value>
   </property>

   <property>
     <name>dfs.namenode.http-address.mycluster.nn1</name>
     <value>hadoop01:50070</value>
   </property>
   <property>
     <name>dfs.namenode.http-address.mycluster.nn2</name>
     <value>hadoop02:50070</value>
   </property>

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/mycluster</value>
</property>

<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>

<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hadoop/.ssh/id_dsa</value>
</property>

<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

<property>
            <name>dfs.replication</name>
            
            
            <value>3</value>
            
            
    </property>

<property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/home/hadoop/dfs/name</value>
    </property>
    <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/home/hadoop/dfs/data</value>
    </property>
    
</configuration>

2、在hadoop01服务器上编辑core-site.xml文件。下面是core-site.xml文件的完整粘贴，蓝色字体部分为本次添加或修改的内容。

[hadoop@hadoop01 hadoop]$ pwd
/home/hadoop/hadoop-2.7.2/etc/hadoop
[hadoop@hadoop01 hadoop]$ vi core-site.xml

http://www.apache.org/licenses/LICENSE-2.0

<configuration>
    
    <property>
            <name>fs.defaultFS</name>
            
            
            <value>hdfs://mycluster</value>
            
            
    </property>

<property>
       <name>dfs.journalnode.edits.dir</name>
       <value>/home/hadoop/dfs/journaldata</value>
    </property>

<property>
       <name>ha.zookeeper.quorum</name>
       <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
    </property>

<property>
            <name>io.file.buffer.size</name>
            <value>131072</value>
    </property>
    <property>
            <name>hadoop.tmp.dir</name>
            <value>file:/home/hadoop/tmp</value>
    </property>
    
</configuration>

3、将hadoop01机器上配置好的 hdfs-site.xml 和 core-site.xml 两个文件复制hadoop02、hadoop03服务器上。

[hadoop@hadoop01 hadoop]$ pwd
/home/hadoop/hadoop-2.7.2/etc/hadoop
[hadoop@hadoop01 hadoop]$ scp hdfs-site.xml core-site.xml hadoop02:$PWD
hdfs-site.xml                                                             100% 2973     2.9KB/s   00:00
core-site.xml                                                             100% 1906     1.9KB/s   00:00
[hadoop@hadoop01 hadoop]$ scp hdfs-site.xml core-site.xml hadoop03:$PWD
hdfs-site.xml                                                             100% 2973     2.9KB/s   00:00
core-site.xml                                                             100% 1906     1.9KB/s   00:00
[hadoop@hadoop01 hadoop]$

4、启动ZooKeeper服务。

#已zookeeper身份，登录hadoop01服务器

[zookeeper@hadoop01 ~]$ zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

#已zookeeper身份，登录hadoop02服务器

[zookeeper@hadoop02 ~]$ zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

#已zookeeper身份，登录hadoop03服务器

[zookeeper@hadoop03 ~]$ zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /home/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

5、初始化zookeeper配置，执行以下命令：

[hadoop@hadoop01 ~]$ hdfs zkfc -formatZK

16/07/03 10:25:13 INFO tools.DFSZKFailoverController: Failover controller configured for NameNode NameNode at hadoop01/192.168.0.201:8020
16/07/03 10:25:14 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1569965, built on 02/20/2014 09:09 GMT
16/07/03 10:25:14 INFO zookeeper.ZooKeeper: Client environment:host.name=hadoop01
16/07/03 10:25:14 INFO zookeeper.ZooKeeper: Client environment:java.version=1.8.0_92
16/07/03 10:25:14 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
16/07/03 10:25:14 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.8.0_92/jre

。。。

Proceed formatting /hadoop-ha/mycluster? (Y or N) Y

16/07/03 10:25:21 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/mycluster from ZK...
16/07/03 10:25:21 INFO ha.ActiveStandbyElector: Successfully deleted /hadoop-ha/mycluster from ZK.
16/07/03 10:25:22 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/mycluster in ZK.
16/07/03 10:25:22 INFO zookeeper.ZooKeeper: Session: 0x155ae929cb60000 closed
16/07/03 10:25:22 INFO zookeeper.ClientCnxn: EventThread shut down

6、执行start-all.sh命令，启动HDFS HA 集群，两台服务器分别为 Active状态和Standby状态。

[hadoop@hadoop01 ~]$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [hadoop01 hadoop02]
hadoop02: starting namenode, ......
hadoop01: starting namenode, ......
hadoop02: starting datanode, ......
hadoop01: starting datanode, ......
hadoop03: starting datanode, ......
Starting journal nodes [hadoop01 hadoop02 hadoop03]
hadoop01: starting journalnode, ......
hadoop02: starting journalnode, ......
hadoop03: starting journalnode, ......
Starting ZK Failover Controllers on NN hosts [hadoop01 hadoop02]
hadoop02: starting zkfc, ......
hadoop01: starting zkfc, ......
starting yarn daemons
starting resourcemanager, ......
hadoop01: starting nodemanager, ......
hadoop02: starting nodemanager, ......
hadoop03: starting nodemanager, ......

7、确认三台服务器上启动的进程情况：

#进入hadoop01服务器执行jps命令

[hadoop@hadoop01 ~]$ jps
2882 NameNode
4034 Jps
2995 DataNode
3466 ResourceManager
3578 NodeManager
3371 DFSZKFailoverController
3213 JournalNode

#进入hadoop02服务器执行jps命令

[hadoop@hadoop02 ~]$ jps
2791 NameNode
2955 JournalNode
3068 DFSZKFailoverController
3421 Jps
3166 NodeManager
2862 DataNode

#进入hadoop03服务器执行jps命令

[hadoop@hadoop03 ~]$ jps
2946 NodeManager
3159 Jps
2792 DataNode
2863 JournalNode

8、登录web，查看两台服务器上namenode状态:

.hadoop01服务器上的Namenode状态为：Active

hadoop02服务器上的Namenode状态为：Standby

9、杀死hadoop01服务器上的Namenode进程，确认故障转移可以自动触发。

[hadoop@hadoop01 ~]$ jps|grep NameNode
nnnn NameNode
[hadoop@hadoop01 ~]$ kill -9 nnnn

10、模拟故障发生后，检查两台服务器上的Namenode状态：

.hadoop01提示为无法访问。

.查看hadoop02服务器，Namenode已自动切换为 Active。

11、单独启动Namenode命令

[hadoop@hadoop01 ~]$ hadoop-daemon.sh start namenode

starting namenode, logging to /home/hadoop/hadoop-2.7.2//logs/hadoop-hadoop-namenode-hadoop01.out

转载于:https://blog.51cto.com/sjinqun/1862267

weixin_34186950

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据：从入门到XX(八)

暂停了好长一段时间，终于可以继续大数据的学习了，今天要学习的是HDFS集群自动故障切换的知识，学习本部分内容，需要提前了解ZooKeeper和HDFS HA QJM相关知识点。Apache ZooKeeper is a highly available service for maintaining small amounts o...
复制链接

扫一扫