Hadoop 2.8.5 HA 集群安装笔记

1 目的

本文记录怎么安装和配置 Hadoop 高可用集群。

2 准备

wget -O /mnt/softs/hadoop-2.8.5.tar.gz http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

原理说明

yarn 是用于 Hadoop 分布式计算时服务器的资源的监控和管理的,只使用 Hadoop 存储服务时,可以不安装启动 yarn 相关的服务。

Zookeeper 是分布式管理协作框架,Zookeeper集群用来保证Hadoop集群的高可用。ZooKeeper是一个协调服务,帮助 ZKFC 执行主 NameNode 的选举。

高可用 Hadoop 集群中有两个 NameNode,一个处于 Active 状态,对外提供服务,一个处于 standby 状态,不对外提供服务,只实时同步 Active 状态的 NameNode 的数据。两个 NameNode 都定时给 Zookeeper 集群发送心跳,报告自己的状态。一旦 Zookeeper 检测不到 Active NameNode 心跳,就判定 Active NameNode 服务不可用,就将 Standby NameNode 的状态切换为 Active 状态,以保证 Hadoop 集群一个 NameNode 不可用,还有一个NameNode 可用,实现了 Hadoop 集群 NameNode 的高可用。

DataNode 用于存储每个文件的“数据块”数据,并且会周期性地向NameNode报告该DataNode的数据存放情况。

Standby NameNode 同步 Active NameNode 的数据是通过 Hadoop 的 JournalNode 服务来实现的。

Yarn HA 原理:在 Hadoop 2.4 版本以前,ResourceManager 存在单点故障问题,ResourceManager 记录着当前集群的资源分配和 Job 运行状态,Yarn HA 利用共享存储介质存储这些信息来达到高可用,另外利用 Zookeeper 实现故障的自动转移。

ZKFC 是需要和 NameNode 一一对应的服务,即每个 NameNode 都需要部署 ZKFC。它负责监控 NameNode 的状态,并及时把状态写入 Zookeeper。ZKFC 也有选择谁作为 Active NameNode 的权利。

在这里插入图片描述

2.1 Yarn 组件

Yarn 是 Hadoop 集群的资源管理系统。Hadoop2.0 对 MapReduce 框架做了彻底的设计重构。
它的基本设计思想是将 MRv1 中的 JobTracker 拆分成了两个独立的服务:一个全局的资源管理器 ResourceManager 和每个应用程序特有的 ApplicationMaster。

其中 ResourceManager 负责整个系统的资源管理和分配,而 ApplicationMaster 负责单个应用程序的管理。

Yarn 基本组成结构

  • Yarn 总体上是 Master/Slave 结构,ResourceManager 为 Master,NodeManager 为 Slave。
  • ResourceManager 负责对各个 NodeManager 上的资源进行统一管理和调度。当用户提交一个应用程序时,需要提供一个用以跟踪和管理这个程序的 ApplicationMaster,它负责向 ResourceManager 申请资源,并要求 NodeManger 启动可以占用一定资源的任务。由于不同的 ApplicationMaster 被分布到不同的节点上,因此它们之间不会相互影响。

Yarn
Yarn 主要由以下组件组成:

序号组件描述
1ResourceManager全局的资源管理器,负责整个系统的资源管理和分配。它主要由两个组件构成:调度器(Scheduler)和应用程序管理器(Applications Manager)。
2NodeManager每个节点上的资源和任务管理器,一方面,它会定时地向 ResourceManager 汇报本节点上的资源使用情况和各个 Container 的运行状态;另一方面,它接收并处理来自ApplicationMaster 的 Container 启动/停止等各种请求。
3ApplicationMaster用户提交的每个应用程序均包含一个 ApplicationMaster,主要功能包括:
1、与ResourceManager Scheduler 协商以获取资源(Container);
2、将得到的任务进一步分配给内部任务;
3、与 NodeManager 通信以启动或停止任务;
4、监控所有任务运行状态,并在任务运行失败时重新为任务申请资源以重启任务
4ContainerYarn 中的资源抽象单元,封装某个节点的内存、CPU、磁盘、网络等多维度资源,ResourceManager 以 Container 为单位返回给 ApplicationMaster 申请的资源。Yarn 会为每个任务分配一个 Container,且该任务只能使用该 Container 中描述的资源。Container 是一个动态资源划分单位,是根据应用程序的需求动态生成的,目前 Yarn 只支持 CPU 和内存两种资源,使用轻量级资源隔离机制 Cgroups。

ResourceManager 的两个组件:

1)调度器(Scheduler)

调度器根据容量、队列等限制条件(如每个队列分配一定的资源,最多执行一定数量的作业等),将系统中的资源分配给各个正在运行的应用程序。

需要注意的是,该调度器是一个“纯调度器”,它不再从事任何与具体应用程序相关的工作,比如不负责监控或者跟踪应用的执行状态等,也不负责重新启动因应用执行失败或者硬件故障而产生的失败任务,这些均交由应用程序相关的 ApplicationMaster 完成。调度器仅根据各个应用程序的资源需求进行资源分配,而资源分配单位用一个抽象概念“资源容器”(Resource Container,简称 Container)表示,Container 是一个动态资源分配单位,它将内存、CPU、磁盘、网络等资源封装在一起,从而限定每个任务使用的资源量。此外,该调度器是一个可插拔的组件,用户可根据自己的需要设计新的调度器,YARN提供了多种直接可用的调度器,比如 Fair Scheduler 和Capacity Scheduler 等。

2) 应用程序管理器(Application Manager)

应用程序管理器负责管理整个系统中所有应用程序,包括应用程序提交、与调度器协商资源以启动 ApplicationMaster、监控 ApplicationMaster 运行状态并在失败时重新启动它等。

3 安装

在典型的部署中,Zookeeper 一般配置3到5个节点。由于 Zookeeper 服务对环境较低的要求,可以将 Zookeeper节点部署到与 HDFS NameNode 和 Standby Node 相同的硬件环境上。许多用户选择将第三个 Zookeeper 进程部署到与 Yarn ResourceManager 相同的节点上。为获取最佳的性能和数据隔离,建议配置 Zookeeper 数据存储与HDFS metadata 位于不同的磁盘上。

由于 JournalNode 是一个轻量级的守护进程,可以与 Hadoop 其它服务共用机器。建议将 JournalNode 部署在控制节点上,以避免数据节点在进行大数据量传输时引起JournalNode写入失败。

由于 NameNode 会在内存中维护所有文件的每个数据块的引用,因此,namenode 很可能会吃光分配给它的所有内存。1000 MB 内存(默认配置)通常足够管理数百万个文件。但是根据经验来看,保守估计需要为每一百万个数据块分配 1000 MB 内存空间。

3.1 集群规划

192.168.3.198192.168.3.199192.168.3.200
主机名 im-test-hadoop01 im-test-hadoop02 im-test-hadoop03
安装软件HadoopHadoopHadoop
ZooKeeperZooKeeperZooKeeper
运行进程NameNode(active)NameNode(standby)--
DataNodeDataNodeDataNode
JournalNodeJournalNodeJournalNode
--ResourceManagerResourceManager
NodeManagerNodeManagerNodeManager
ZKFCZKFC--
QuorumPeerMain
(ZooKeeper)
QuorumPeerMain
(ZooKeeper)
QuorumPeerMain
(ZooKeeper)

JournalNode 是 QJM 存储段进程,提供日志读写,存储,修复等服务。
搭建奇数节点的 JournalNode 实现 NameNode 主备之间的元数据信息的同步。
QJM的基本原理就是用2N+1台JournalNode存储EditLog,每次写数据操作有大多数(>=N+1)返回成功时即认为该次写成功,数据不会丢失了。当然这个算法所能容忍的是最多有N台机器挂掉,如果多于N台挂掉,这个算法就失效了。

3.2 安装步骤

解压 hadoop 安装包

tar zxf /mnt/softs/hadoop-2.8.5.tar.gz -C /mnt/softwares/

配置 core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://cluster1</value>
  <description>Hadoop FS 客户端使用的默认路径前缀。hdfs://nameservices ID</description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop/tmp</value>
  <description>hadoop 临时目录,默认NameNode 和 DataNode 的数据文件都会存在该目录下的对应的子目录下,Hadoop MapReduce 运算的目录也在该临时目录下。默认系统临时目录,重启会丢失数据,建议改为自定义目录。有的配置在 hdfs-yarn.xml 文件中</description>
</property>

<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
  <description>文件缓存大小,单位 byte。默认4096,一般配置成系统页面大小的倍数</description>
</property>

<property>
  <name>fs.trash.interval</name>
  <value>1440</value>
  <description>默认0,表示不开启垃圾箱功能,大于0时,表示删除的文件在垃圾箱中存留的分钟数,</description>
</property>

<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>1440</value>
  <description>垃圾回收检查的时间间隔,值应该小于等于 fs.trash.interval。默认0,此时按fs.trash.interval的值大小执行</description>
</property>

<property>
  <name>ha.zookeeper.quorum</name>
  <value>im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181</value>
  <description>ZooKeeper 服务地址列表,ZKFC在自动故障转移时使用</description>
</property>

配置 hdfs-site.xml

<property>
  <name>dfs.nameservices</name>
  <value>cluster1</value>
  <description>指定hdfs的nameservice,需要和core-site.xml中的保持一致,多个时逗号间隔</description>
</property>

<property>
  <name>dfs.ha.namenodes.cluster1</name>
  <value>nn1,nn2</value>
  <description>配置 hdfs 下的 namenode 节点名</description>
</property>

<property>
  <name>dfs.namenode.rpc-address.cluster1.nn1</name>
  <value>im-test-hadoop01:9000</value>
  <description>nn1 的 rpc 通信地址,用来和 datanode 通信</description>
</property>

<property>
  <name>dfs.namenode.http-address.cluster1.nn1</name>
  <value>im-test-hadoop01:50070</value>
  <description>nn1 的 http 通信地址,hdfs namenode web 页面访问地址</description>
</property>

<property>
  <name>dfs.namenode.rpc-address.cluster1.nn2</name>
  <value>im-test-hadoop02:9000</value>
  <description>nn2 的 rpc 通信地址,用来和 datanode 通信</description>
</property>

<property>
  <name>dfs.namenode.http-address.cluster1.nn2</name>
  <value>im-test-hadoop02:50070</value>
  <description>nn2 的 http 通信地址,hdfs namenode web 页面访问地址</description>
</property>

<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://im-test-hadoop01:8485;im-test-hadoop02:8485;im-test-hadoop03:8485/cluster1</value>
  <description> NameNode 间读写共享 edits log 的 JournalNode 的 URI 列表, Active NameNode 写入,Standby NameNode 读取。cluster1 是 journal ID,建议和 nameservice ID 一致</description>
</property>

<property>
  <name>dfs.namenode.name.dir</name>
  <value>/mnt/hadoop/dfs/name</value>
  <description>配置 NameNode 数据(fsimage)的本地存储目录位置,file://${hadoop.tmp.dir}/dfs/name</description>
</property>

<property>
  <name>dfs.datanode.data.dir</name>
  <value>/mnt/hadoop/dfs/data</value>
  <description>配置 DataNode 数据(hdfs blocks)的本地存储目录,file://${hadoop.tmp.dir}/dfs/data</description>
</property>

<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/mnt/hadoop/journal/data</value>
  <description>JournalNode 存储 edits 和其他状态数据目录位置</description>
</property>

<property>
  <name>dfs.namenode.handler.count</name>
  <value>100</value>
  <description>监听来自 DataNode 客户端请求的 NameNode RPC 服务器线程数,默认10</description>
</property>

<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
  <description>新文件默认块大小,默认值为134217728(128M)</description>
</property>

<property>
  <name>dfs.ha.automatic-failover.enabled</name>
  <value>true</value>
  <description>是否启用 NameNode 故障自动转移,默认 fasle</description>
</property>

<property>
  <name>dfs.client.failover.proxy.provider.cluster1</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
  <description>HDFS 客户端联系 Active NameNode 的 Java 类,默认提供了两个实现类,另外一个是 RequestHedgingProxyProvider,可以自定义实现该类。该类用来识别当前 Active NameNode</description>
</property>

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
  <description>在 failover 期间用来隔离 Active Namenode 的脚本或者java 类列表</description>
</property>

<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/hadoop/.ssh/id_rsa</value>
  <description>SSH 免密登录私钥文件路径,使用sshfence隔离机制时才需要配置</description>
</property>

<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>hdfs 副本数</description>
</property>

虽然 JournalNodes可以确保集群中只有一个 Active NameNode 写入 edits,这对保护 edits 一致性很重要,但是在 failover 期间,有可能 Acitive NameNode 仍然存活,Client 可能还与其保持连接提供旧的数据服务,我们可以通过此配置,指定 shell 脚本或者 java 程序,SSH 到 Active NameNode 然后 Kill NameNode 进程。它有两种可选值:

  • sshfence:SSH登录到 Active NameNode,并 Kill 此进程。首先当前机器能够使用 SSH 登录到远端,前提是已经授权(rsa)。
  • shell:运行 shell 指令隔离 Active NameNode。

配置 mapred-site.xml
复制 map-site.xml.template 为 mapred-site.xml 并编辑配置

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>执行 MapReduce Job 的运行时框架,默认 local,可以是 local、classic 和 yarn</description>
</property>

<property>
  <name>mapreduce.jobhistory.address</name>
  <value>im-test-hadoop:10020</value>
  <description>MapReduce JobHistory 服务地址</description>
</property>

<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>im-test-hadoop:19888</value>
  <description>MapReduce JobHistory 页面访问地址</description>
</property>

历史服务(JobHistory)介绍
Hadoop开启历史服务可以在web页面上查看Yarn上执行job情况的详细信息。可以通过历史服务器查看已经运行完的Mapreduce作业记录,比如用了多少个Map、用了多少个Reduce、作业提交时间、作业启动时间、作业完成时间等信息。

配置 yarn-site.xml

ResourceManager 高可用最少配置示例

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MapReduce程序</description>
</property>

<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
  <description>是否启用 ResourceManager 高可用,默认 false</description>
</property>

<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>cluster1</value>
  <description>ResourceManager 集群名</description>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>im-test-hadoop02</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>im-test-hadoop03</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>im-test-hadoop02:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>im-test-hadoop03:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181</value>
</property>

<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
  <description>是否开启日志聚集功能,默认false。应用执行完成后,Log Aggregation 收集每个 Container 的日志到 HDFS 上</description>
</property>

<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>25200</value>
  <description>聚集日志最长保留时间</description>
</property>

配置 slaves
添加编辑 $HADOOP_HOME/etc/hadoop/slaves,slaves 文件指定集群中有哪些 DataNode 节点:

im-test-hadoop01
im-test-hadoop02
im-test-hadoop03

如果需要重新格式化 NameNode,需要先将原来 NameNode 和 DataNode 下的文件全部删除,不然会报错,NameNode 和 DataNode 所在目录是在 core-site.xml 中 hadoop.tmp.dir、dfs.namenode.name.dir、dfs.datanode.data.dir属性配置的。因为每次格式化,默认是创建一个集群ID,并写入NameNode和DataNode的VERSION文件中(VERSION文件所在目录为dfs/name/current 和 dfs/data/current),重新格式化时,默认会生成一个新的集群ID,如果不删除原来的目录,会导致namenode中的VERSION文件中是新的集群ID,而DataNode中是旧的集群ID,不一致时会报错。

另一种方法是格式化时指定集群ID参数,指定为旧的集群ID。

分发 Hadoop 到集群中的其它服务器

scp -rp /mnt/softwares/hadoop-2.8.5 im-test-hadoop02:/mnt/softwares/
scp -rp /mnt/softwares/hadoop-2.8.5 im-test-hadoop03:/mnt/softwares/

3.3 启动

第一步,启动 Zookeeper 服务
参照 Zookeeper 安装笔记

查看服务:

[hadoop@im-test-hadoop01 ~]$ jps
9024 Jps
21699 QuorumPeerMain
[hadoop@im-test-hadoop02 ~]$ jps
4866 Jps
16348 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$ jps
12835 Jps
24584 QuorumPeerMain

第二步,在 Zookeeper 中创建一个存储 NameNode HA 相关数据的 zNode

预先查看 Zookeeper 节点:

$ZOOKEEPER_HOME/bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper]

只有 zookeeper 一个节点。执行下面的命令:

$HADOOP_PREFIX/bin/hdfs zkfc -formatZK

部分执行日志:

18/11/01 11:22:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=im-test-hadoop01:2181,im-test-hadoop02:2181,im-test-hadoop03:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a78afa0
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Opening socket connection to server im-test-hadoop02/192.168.3.199:2181. Will not attempt to authenticate using SASL (unknown error)
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Socket connection established to im-test-hadoop02/192.168.3.199:2181, initiating session
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: Session establishment complete on server im-test-hadoop02/192.168.3.199:2181, sessionid = 0x200ba6364c30002, negotiated timeout = 10000
18/11/01 11:22:57 INFO ha.ActiveStandbyElector: Session connected.
18/11/01 11:22:57 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/cluster1 in ZK.
18/11/01 11:22:57 INFO zookeeper.ZooKeeper: Session: 0x200ba6364c30002 closed
18/11/01 11:22:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/11/01 11:22:57 INFO tools.DFSZKFailoverController: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at im-test-hadoop01/192.168.3.198
************************************************************/

上面的命令将在 Zookeeper 中创建一个 znode,存储自动故障转移系统的数据。
再次 Zookeeper 集群的一台服务器上登录 Zookeeper,查看

[zk: localhost:2181(CONNECTED) 1] ls /
[zookeeper, hadoop-ha]

发现多了一个 hadoop-ha 节点,说明执行成功。

Hadoop 配置文件不同步时,执行配置文件复制:
scp -pqr /mnt/softwares/hadoop-2.8.5/etc/hadoop im-test-hadoop02:/mnt/softwares/hadoop-2.8.5/etc/
scp -pqr /mnt/softwares/hadoop-2.8.5/etc/hadoop im-test-hadoop03:/mnt/softwares/hadoop-2.8.5/etc/

第三步,启动 HDFS集群服务

启动所有 JournalNode 进程:

/mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start journalnode

启动后结果:

[hadoop@im-test-hadoop01 ~]$ jps
21699 QuorumPeerMain
12633 Jps
12559 JournalNode
[hadoop@im-test-hadoop02 ~]$ jps
7879 Jps
16348 QuorumPeerMain
7805 JournalNode
[hadoop@im-test-hadoop03 ~]$ jps
15973 Jps
24584 QuorumPeerMain
15899 JournalNode

格式化 Active NameNode:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs namenode -format
18/11/01 14:30:27 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = hadoop
STARTUP_MSG:   host = im-test-hadoop01/192.168.3.198
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.8.5
STARTUP_MSG:   classpath = /mnt/softwares/hadoop-2.8.5/etc/hadoop 。。。。// 该行内容省略
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8; compiled by 'jdu' on 2018-09-10T03:32Z
STARTUP_MSG:   java = 1.8.0_181
************************************************************/
18/11/01 14:30:27 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/11/01 14:30:27 INFO namenode.NameNode: createNameNode [-format]
18/11/01 14:30:28 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:30:28 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb
18/11/01 14:30:28 INFO namenode.FSEditLog: Edit logging is async:true
18/11/01 14:30:28 INFO namenode.FSNamesystem: KeyProvider: null
18/11/01 14:30:28 INFO namenode.FSNamesystem: fsLock is fair: true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
18/11/01 14:30:28 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
18/11/01 14:30:28 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
18/11/01 14:30:28 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
18/11/01 14:30:28 INFO blockmanagement.BlockManager: The block deletion will start around 2018 十一月 01 14:30:28
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map BlocksMap
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 2.0% max memory 889 MB = 17.8 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^21 = 2097152 entries
18/11/01 14:30:28 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
18/11/01 14:30:28 INFO blockmanagement.BlockManager: defaultReplication         = 3
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxReplication             = 512
18/11/01 14:30:28 INFO blockmanagement.BlockManager: minReplication             = 1
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
18/11/01 14:30:28 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
18/11/01 14:30:28 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
18/11/01 14:30:28 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
18/11/01 14:30:28 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
18/11/01 14:30:28 INFO namenode.FSNamesystem: supergroup          = supergroup
18/11/01 14:30:28 INFO namenode.FSNamesystem: isPermissionEnabled = true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Determined nameservice ID: cluster1
18/11/01 14:30:28 INFO namenode.FSNamesystem: HA Enabled: true
18/11/01 14:30:28 INFO namenode.FSNamesystem: Append Enabled: true
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map INodeMap
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 1.0% max memory 889 MB = 8.9 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^20 = 1048576 entries
18/11/01 14:30:28 INFO namenode.FSDirectory: ACLs enabled? false
18/11/01 14:30:28 INFO namenode.FSDirectory: XAttrs enabled? true
18/11/01 14:30:28 INFO namenode.NameNode: Caching file names occurring more than 10 times
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map cachedBlocks
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 0.25% max memory 889 MB = 2.2 MB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^18 = 262144 entries
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
18/11/01 14:30:28 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
18/11/01 14:30:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
18/11/01 14:30:28 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
18/11/01 14:30:28 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
18/11/01 14:30:28 INFO util.GSet: Computing capacity for map NameNodeRetryCache
18/11/01 14:30:28 INFO util.GSet: VM type       = 64-bit
18/11/01 14:30:28 INFO util.GSet: 0.029999999329447746% max memory 889 MB = 273.1 KB
18/11/01 14:30:28 INFO util.GSet: capacity      = 2^15 = 32768 entries
18/11/01 14:30:29 INFO namenode.FSImage: Allocated new BlockPoolId: BP-206245647-192.168.3.198-1541053829034
18/11/01 14:30:29 INFO common.Storage: Storage directory /mnt/hadoop/dfs/name has been successfully formatted.
18/11/01 14:30:29 INFO namenode.FSImageFormatProtobuf: Saving image file /mnt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
18/11/01 14:30:29 INFO namenode.FSImageFormatProtobuf: Image file /mnt/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
18/11/01 14:30:29 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/11/01 14:30:29 INFO util.ExitUtil: Exiting with status 0
18/11/01 14:30:29 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at im-test-hadoop01/192.168.3.198
************************************************************/

启动 Active NameNode:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop01.out
[hadoop@im-test-hadoop01 ~]$ jps
21699 QuorumPeerMain
13060 Jps
12870 NameNode
12559 JournalNode

同步 Standby NameNode:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs namenode -bootstrapStandby
18/11/01 14:38:44 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = hadoop
STARTUP_MSG:   host = im-test-hadoop02/192.168.3.199
STARTUP_MSG:   args = [-bootstrapStandby]
STARTUP_MSG:   version = 2.8.5
STARTUP_MSG:   classpath = /mnt/softwares/hadoop-2.8.5/etc/hadoop ..............\\ 改行内容省略
STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8; compiled by 'jdu' on 2018-09-10T03:32Z
STARTUP_MSG:   java = 1.8.0_181
************************************************************/
18/11/01 14:38:44 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
18/11/01 14:38:44 INFO namenode.NameNode: createNameNode [-bootstrapStandby]
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
=====================================================
About to bootstrap Standby ID nn2 from:
           Nameservice ID: cluster1
        Other Namenode ID: nn1
  Other NN's HTTP address: http://im-test-hadoop01:50070
  Other NN's IPC  address: im-test-hadoop01/192.168.3.198:9000
             Namespace ID: 448464711
            Block pool ID: BP-206245647-192.168.3.198-1541053829034
               Cluster ID: CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb
           Layout version: -63
       isUpgradeFinalized: true
=====================================================
18/11/01 14:38:45 INFO common.Storage: Storage directory /mnt/hadoop/dfs/name has been successfully formatted.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 WARN common.Util: Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
18/11/01 14:38:45 INFO namenode.FSEditLog: Edit logging is async:true
18/11/01 14:38:45 INFO namenode.TransferFsImage: Opening connection to http://im-test-hadoop01:50070/imagetransfer?getimage=1&txid=0&storageInfo=-63:448464711:1541053829034:CID-fd4c6a4f-ce09-4cb3-b064-c70a2212a2bb&bootstrapstandby=true
18/11/01 14:38:45 INFO namenode.TransferFsImage: Image Transfer timeout configured to 60000 milliseconds
18/11/01 14:38:45 INFO namenode.TransferFsImage: Transfer took 0.01s at 0.00 KB/s
18/11/01 14:38:45 INFO namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000000000 size 323 bytes.
18/11/01 14:38:46 INFO util.ExitUtil: Exiting with status 0
18/11/01 14:38:46 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at im-test-hadoop02/192.168.3.199
************************************************************/

启动 Standby NameNode:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop02.out

此时查看两个两个 NameNode 的 web 页面,都是 standby 状态

http://192.168.3.198:50070http://192.168.3.199:50070

切换第一台为 Active 状态

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -transitionToActive nn1
Automatic failover is enabled for NameNode at im-test-hadoop02/192.168.3.199:9000
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please 
specify the --forcemanual flag.
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -transitionToActive -forcemanual nn1
You have specified the --forcemanual flag. This flag is dangerous, as it can induce a split-brain scenario that WILL CORRUPT your HDFS namespace, possibly irrecoverably.

It is recommended not to use this flag, but instead to shut down the cluster and disable automatic failover if you prefer to manually manage your HA state.

You may abort safely by answering 'n' or hitting ^C now.

Are you sure you want to continue? (Y or N) Y
18/11/01 14:56:49 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at im-test-hadoop02/192.168.3.199:9000
18/11/01 14:56:49 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at im-test-hadoop01/192.168.3.198:9000

再次查看 web 页面,NameNode 已变为 Active 状态。

停止 hdfs 相关进程:

[hadoop@im-test-hadoop01 ~]$ jps
13936 Jps
21699 QuorumPeerMain
12870 NameNode
12559 JournalNode
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop namenode
stopping namenode
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop01 ~]$ jps
14033 Jps
21699 QuorumPeerMain
[hadoop@im-test-hadoop01 ~]$
[hadoop@im-test-hadoop02 ~]$ jps
8202 NameNode
9098 Jps
16348 QuorumPeerMain
7805 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop namenode
stopping namenode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop02 ~]$ jps
9179 Jps
16348 QuorumPeerMain
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ jps
24584 QuorumPeerMain
16505 Jps
15899 JournalNode
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/hadoop-daemon.sh stop journalnode
stopping journalnode
[hadoop@im-test-hadoop03 ~]$ jps
16567 Jps
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$ 

通过在 Active NameNode 所在服务器上执行 sbin/start-dfs.sh 启动集群

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/start-dfs.sh
Starting namenodes on [im-test-hadoop01 im-test-hadoop02]
im-test-hadoop01: starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop01.out
im-test-hadoop02: starting namenode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-namenode-im-test-hadoop02.out
im-test-hadoop01: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop01.out
im-test-hadoop02: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop02.out
im-test-hadoop03: starting datanode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-datanode-im-test-hadoop03.out
Starting journal nodes [im-test-hadoop01 im-test-hadoop02 im-test-hadoop03]
im-test-hadoop01: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop01.out
im-test-hadoop02: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop02.out
im-test-hadoop03: starting journalnode, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-journalnode-im-test-hadoop03.out
Starting ZK Failover Controllers on NN hosts [im-test-hadoop01 im-test-hadoop02]
im-test-hadoop01: starting zkfc, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-zkfc-im-test-hadoop01.out
im-test-hadoop02: starting zkfc, logging to /mnt/softwares/hadoop-2.8.5/logs/hadoop-hadoop-zkfc-im-test-hadoop02.out
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
17496 NameNode
18330 Jps
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$

查看其它两台服务器服务进程

[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
12277 Jps
11739 NameNode
16348 QuorumPeerMain
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
18195 Jps
17972 DataNode
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$

start-dfs.sh 脚本启动的服务包括 NameNode、DataNode、JournalNode 和 ZKFC。

因为已经开启自动故障转移,start-dfs.sh 脚本会在所有运行 NameNode 的服务器上自动启动一个 ZKFC 进程。当所有的 ZKFC 启动时,它们会选择其中一个 NameNode 作为 Active。

如果使用手动方式管理集群服务,需要在每个运行 NameNode 的服务器上手动启动 ZKFC 进程:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh --script $HADOOP_PREFIX/bin/hdfs start zkfc

测试 HDFS HA
查看 NameNode 状态,NameNode nn1 是 active状态,nn2 是 standby 状态:

[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -getServiceState nn1
active
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hdfs haadmin -getServiceState nn2
standby
[hadoop@im-test-hadoop03 ~]$

向 hadoop 上传一个文件

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -put /mnt/softwares/hadoop-2.8.5/NOTICE.txt /
[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -ls /
Found 1 items
-rw-r--r--   3 hadoop supergroup      15915 2018-11-01 17:25 /NOTICE.txt
[hadoop@im-test-hadoop01 ~]$

现将 nn1 进程杀掉,查看 nn2 状态和上面上传文件信息

[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
18933 Jps
17496 NameNode
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$ kill -9 17496
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
18983 Jps
18236 DFSZKFailoverController
17631 DataNode
[hadoop@im-test-hadoop01 ~]$
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/hadoop fs -ls /
18/11/01 17:37:23 WARN ipc.Client: Failed to connect to server: im-test-hadoop01/192.168.3.198:9000: try once and fail.
java.net.ConnectException: 拒绝连接
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
	at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
	at org.apache.hadoop.ipc.Client.call(Client.java:1381)
	at org.apache.hadoop.ipc.Client.call(Client.java:1345)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
	at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:796)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
	at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1649)
	at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
	at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1437)
	at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
	at org.apache.hadoop.fs.Globber.doGlob(Globber.java:282)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:148)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1686)
	at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
	at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
	at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
	at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
	at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
	at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
	at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Found 1 items
-rw-r--r--   3 hadoop supergroup      15915 2018-11-01 17:25 /NOTICE.txt

以上已验证实现了 NameNode 间的数据同步和故障自动转移。

然后启动 nn1,其为 standby 状态,然后停止 nn2,则 nn1 变为 active 状态,即可以进行 NameNode 状态的来回切换。

第四步,启动 yarn
在 im-test-hadoop01 服务器上执行:

[hadoop@im-test-hadoop01 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop01.out
im-test-hadoop01: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop01.out
im-test-hadoop03: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop03.out
im-test-hadoop02: starting nodemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop01 ~]$ jps
17904 JournalNode
21699 QuorumPeerMain
21971 NodeManager
19428 NameNode
22133 Jps
18236 DFSZKFailoverController
17631 DataNode

在 im-test-hadoop02、im-test-hadoop03 上启动 ResourceManager:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
13990 NameNode
17623 Jps
16348 QuorumPeerMain
17357 ResourceManager
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$
[hadoop@im-test-hadoop03 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop03.out
[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
17972 DataNode
20966 NodeManager
24584 QuorumPeerMain
21308 Jps
21231 ResourceManager
[hadoop@im-test-hadoop03 ~]$

查看 ResourceManager 状态

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
active
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
standby
[hadoop@im-test-hadoop02 ~]$

访问 yarn web 页面 http://im-test-hadoop02:8088/cluster,正常显示集群信息
访问 http://im-test-hadoop03:8088/cluster, 会自动跳转到 http://im-test-hadoop02:8088/cluster ,因为 rm2 是 standby 状态。

Yarn HA 测试
现将 active ResourceManager rm1 杀掉,查看 rm2 状态

[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
18452 Jps
17125 NodeManager
13990 NameNode
16348 QuorumPeerMain
17357 ResourceManager
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ kill -9 17357
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
18485 Jps
13990 NameNode
16348 QuorumPeerMain
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
active
[hadoop@im-test-hadoop02 ~]$

rm2 状态已变为 active,yarn ha 配置有效。

再启动 rm1

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/sbin/yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /mnt/softwares/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-im-test-hadoop02.out
[hadoop@im-test-hadoop02 ~]$ jps
11841 DataNode
17125 NodeManager
13990 NameNode
18634 ResourceManager
16348 QuorumPeerMain
18701 Jps
12206 DFSZKFailoverController
11967 JournalNode
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm2
active
[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
standby
[hadoop@im-test-hadoop02 ~]$

杀掉 rm2:

[hadoop@im-test-hadoop03 ~]$ jps
18099 JournalNode
17972 DataNode
21813 Jps
20966 NodeManager
24584 QuorumPeerMain
21231 ResourceManager
[hadoop@im-test-hadoop03 ~]$ kill -9 21231
[hadoop@im-test-hadoop03 ~]$ jps
21888 Jps
18099 JournalNode
17972 DataNode
20966 NodeManager
24584 QuorumPeerMain
[hadoop@im-test-hadoop03 ~]$

rm1 状态变为 active:

[hadoop@im-test-hadoop02 ~]$ /mnt/softwares/hadoop-2.8.5/bin/yarn rmadmin -getServiceState rm1
active
[hadoop@im-test-hadoop02 ~]$
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值