一、系统初始化
1、禁用防火墙
systectl disable firewalld
system stop firewalld
2、禁用selinux
sed -i ‘s/SELINUX=enforcing/SELINUX=disabled/g’ /etc/selinux/config
setenforce 0
3、安装依赖包
yum install -y psmisc
fence需要使用到fuser,需要安装psmisc
4、安装jdk
hadoop对jdk版本的要求特别高,需要从官方文档中区查看jdk的版本,在安装的相关页面,有介绍java版本的详细连接,下载稳定版的hadoop,并下载相匹配的jdk
本次安装的是hadoop3.2.3
由于jdk-8u202以上版本是收费版本,所以我们不使用202版本
5、修改hosts文件并配置时钟同步
修改hosts文件,并配置时间同步,生产环境中需要搭建时钟同步服务器,实验环境可以先与阿里云同步一下,保证所有机器时间一致即可。所有机器都要做
也可以使用dns,但是要同时配置正向解析和反向解析
[root@lab ~]# yum install ntp
[root@lab ~]# ntpdate ntp1.aliyun.com
19 Nov 10:35:38 ntpdate[1234]: step time server 120.25.115.20 offset 7.951098 sec
6、创建普通用户
hadoop需要使用普通用户运行
useradd admin && echo “passtest” | passwd --stdin admin
7、设置免密登陆
每个服务器都需要做互相免密认证
在主服务器上使用如下命令设置免密登陆
[zk@lab hadoop]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/zk/.ssh/id_rsa):
Created directory ‘/home/zk/.ssh’.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/zk/.ssh/id_rsa.
Your public key has been saved in /home/zk/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:44TdgI++a+zXQx+gV3gBHy0WXneeju2jp0hOHy1kr2U zk@lab1
The key’s randomart image is:
±–[RSA 2048]----+
| …oo. o|
| . o+o.oo|
| . . ooo …|
| = oo o + |
| o S…+ + o |
| . o…o + + |
| … .+ + + E |
| o… * + B…|
| o+o + =o |
±—[SHA256]-----+
[zk@lab hadoop]$ ssh-copy-id -i hdp@10.10.10.6
[zk@lab hadoop]$ ssh-copy-id -i hdp@10.10.10.7
二、安装zookeeper
在三台机器上安装zookeeper集群,具体安装方法详见zookeeper安装文档
三、安装配置hadoop
将hadoop文件上传到机器上,解压hadoop文件到opt目录下(也可以自己定义一个文件夹)
tar zxvf hadoop-3.2.3.tar.gz -C /opt
1、创建需要的文件夹
sudo mkdir -p /hadoop/tmp
sudo mkdir -p /hadoop/data
sudo chown -R hdp:hdp /hadoop
2、修改etc/hadoop/core-site.xml文件
<configuration>
<!-- 指定hdfs的nameservice为nscluster -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://nscluster</value>
</property>
<!-- 指定hadoop tmp目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/hadoop/tmp</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hdp</value>
</property>
<!-- 指定zookeeper地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>lab1:2181,lab2:2181,lab3:2181</value>
</property>
</configuration>
3、修改etc/hadoop/hdfs-site.xml文件
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<!-- 指定hdfs的nameservice为nscluster,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>nscluster</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<!-- nn1以及nn2不要与机器名lab1以及lab2相同 ,nscluster下面有两个namenode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.nscluster</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址,nn1所在地址 -->
<property>
<name>dfs.namenode.rpc-address.nscluster.nn1</name>
<value>lab1:8020</value>
</property>
<!-- nn1的http通信地址,外部访问地址 -->
<property>
<name>dfs.namenode.http-address.nscluster.nn1</name>
<value>lab1:50070</value>
</property>
<!-- nn2的RPC通信地址,nn2所在地址 -->
<property>
<name>dfs.namenode.rpc-address.nscluster.nn2</name>
<value>lab2:8020</value>
</property>
<!-- nn2的http通信地址,外部访问地址 -->
<property>
<name>dfs.namenode.http-address.nscluster.nn2</name>
<value>lab2:50070</value>
</property>
<!-- 指定NameNode的元数据在JournalNode日志上的存放位置(一般和zookeeper部署在一起) -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://lab1:8485;lab2:8485;lab3:8485/nscluster</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/hadoop/journal</value>
</property>
<!-- 这个是开启自动故障转移,如果你没有自动故障转移,这个可以先不配 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 客户端通过代理访问namenode,访问文件系统,HDFS 客户端与Active 节点通信的Java 类,使用其确定Active 节点是否活跃 -->
<property>
<name>dfs.client.failover.proxy.provider.nscluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 这是配置自动切换的方法,有多种使用方法,具体可以看官网,这里优先选择远程登录杀死的方法,失败后选择shell(/bin/true)方法 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
shell(/bin/true)
</value>
</property>
<!-- 这个是使用sshfence隔离机制时才需要配置ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/admin/.ssh/id_rsa</value>
</property>
<!-- 配置sshfence隔离机制超时时间,这个属性同上,如果你是用脚本的方法切换,这个应该是可以不配置的 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
</configuration>
4、修改etc/hadoop/mapred-site.xml
<configuration>
<!-- 指定mr框架为yarn方式 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/middleware/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/middleware/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/middleware/hadoop</value>
</property>
</configuration>
</configuration>
5、修改etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!--启用resourcemanager ha-->
<!--是否开启RM ha,默认是开启的-->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!--声明两台resourcemanager的cluster ID -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>rmcluster</value>
</property>
<!-- 指定resourcemanager的名字 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!-- 分别指定resourcemanager的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>lab1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>lab2</value>
</property>
<!--指定zookeeper集群的地址-->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>lab1:2181,lab2:2181,lab3:2181</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--启用自动恢复,当任务进行一半,rm坏掉,就要启动自动恢复,默认是false-->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<!--指定resourcemanager的状态信息存储在zookeeper集群,默认是存放在FileSystem里面。-->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>
<!--如果不添加以下内容的话运行mapreduce的时候会出现问题。这里怀疑是AM和RM通信有问题;一台是备RM,一台活动的RM,在YARN内部,当MR去活动的RM为任务获取资源的时候没问题,但是去备RM获取时就会出现这个问题。-->
<property>
<name>yarn.resourcemanager.address.rm1</name>
<value>lab1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm1</name>
<value>lab1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>lab1:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm1</name>
<value>lab1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm1</name>
<value>lab1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm1</name>
<value>lab1:23142</value>
</property>
<property>
<name>yarn.resourcemanager.address.rm2</name>
<value>lab2:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>lab2:8030</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>lab2:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address.rm2</name>
<value>lab2:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address.rm2</name>
<value>lab2:8033</value>
</property>
<property>
<name>yarn.resourcemanager.ha.admin.address.rm2</name>
<value>lab2:23142</value>
</property>
</configuration>
然后修改yarn-site.xml文件
执行如下命令
[admin@hdpnn1 hadoop]$ hadoop classpath
/middleware/hadoop/etc/hadoop:/middleware/hadoop/share/hadoop/common/lib/*:/middleware/hadoop/share/hadoop/common/*:/middleware/hadoop/share/hadoop/hdfs:/middleware/hadoop/share/hadoop/hdfs/lib/*:/middleware/hadoop/share/hadoop/hdfs/*:/middleware/hadoop/share/hadoop/mapreduce/lib/*:/middleware/hadoop/share/hadoop/mapreduce/*:/middleware/hadoop/share/hadoop/yarn:/middleware/hadoop/share/hadoop/yarn/lib/*:/middleware/hadoop/share/hadoop/yarn/*
将得到的信息也添加到yarn-site.xml文件中
<property>
<name>yarn.application.classpath</name>
<value>{刚才执行hadoop classpath命令得到的信息}</name>
</property>
6、配置httpfs
配置httpfs主要是为了以后给hue使用
因为在namenode的HA模式下只能使用该种方式。
vi /soft/hadoop/etc/hadoop/httpfs-site.xml
<property>
<name>httpfs.proxyuser.admin.hosts</name>
<value>*</value>
</property>
<property>
<name>httpfs.proxyuser.admin.groups</name>
<value>*</value>
</property>
7、设置日志输出路径
hadoop的日志是通过log4j输出的,所以配置etc/hadoop/log4j.properties 即可
[admin@hdpnn2 hadoop]$ more log4j.properties
Define some default values that can be overridden by system properties
hadoop.root.logger=INFO,console
#在这里制定log数据路径
hadoop.log.dir=/middleware/hadoop/logs
hadoop.log.file=hadoop.log
修改slave文件(slave是指定子节点的位置,)hadoop3.0以后slaves更名为workers
slaves的文件里边列出了所有datanode的名字,这个文件纯粹就是给批量启动脚本使用的
因为haoop的批量启动命令就是使用一个for循环,从把slave文件里的地址全部启动一遍。
7、配置其它机器
scp hadoop到其他两台机器,拷贝之前删除share/doc文件
[zk@lab1 hadoop-3.2.1]$ pwd
/opt/hadoop-3.2.1
[zk@lab1 hadoop-3.2.1]$ rm -rf share/doc/*
[zk@lab1 opt]$ scp -r hadoop-3.2.1 zk@10.10.10.6:/opt
[zk@lab1 opt]$ scp -r hadoop-3.2.1 zk@10.10.10.7:/opt
在resourcemanager上即lab1和lab2的yarn-site.xml上添加HA id
在lab1机器上添加以下内容
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm1</value>
</property>
在lab2机器上添加以下内容
<property>
<name>yarn.resourcemanager.ha.id</name>
<value>rm2</value>
</property>
四、启动hadoop
1、启动zookeeper
所有机器上在zookeeper的安装用下通过zkServer.sh start启动zookeeper
2、启动journalnode
在每个journalnode节点用如下命令启动journalnode
./sbin/hadoop-daemon.sh start journalnode
3、格式化lab1 namenode(只是第一次需要格式化)
./bin/hdfs namenode -format
出现以下界面的话说明命令的输入有问题。仔细检查命令格式及内容。
如果出现以下界面,说明格式化成功,输出信息中没有报错,开始是
2019-11-19 17:23:50,829 INFO namenode.NameNode: STARTUP_MSG:
最后是:
SHUTDOWN_MSG: Shutting down NameNode at lab1/10.10.10.5
4、启动 lab1 namenode
./sbin/hadoop-daemon.sh start namenode
使用jps查看namenode启动结果
[zk@lab1 journal]$ jps
2713 NameNode
2154 JournalNode
1900 QuorumPeerMain
2780 Jps
5、在其他那么node上同步lab1 namenode元数据
./bin/hdfs namenode -bootstrapStandby
6、启动lab2以及lab3 namenode
./sbin/hadoop-daemon.sh start namenode
7、格式化ZOOKEEPER
./bin/hdfs zkfc -formatZK
8、启动所有服务
1、启动dfs
dfs的namenode是通过zkcontroller来进行高可用的,通过start-dfs.sh脚本可以自动拉起zkcontroller
./sbin/start-dfs.sh
2、启动yarn
./sbin/start-yarn.sh
3、启动历史日志服务器
./bin/mapred --daemon start historyserver
4、启动httpfs服务
./bin/hdfs --daemon start httpfs
#补充:如果想单独启动zkfc使用以下命令
./bin/hdfs --daemon start zkfc
五、强制节点变为active(手动故障转移)
此时进入namenode的50070页面,两个namenode都是standby状态
http://10.10.10.5:50070 web页面,
http://10.10.10.6:50070 web页面,
这时可以先强制手动是其中一个节点变为active
./bin/hdfs haadmin -transitionToActive nn1 --forcemanual
六、配置自动故障转移
先把整个集群关闭,zookeeper不关,格式化ZKFC
./bin/hdfs zkfc -formatZK
注意输入命令是否正确,以及输出内容,以下输入供参考
2022-04-08 10:58:37,852 INFO ha.ActiveStandbyElector: Session connected.
The configured parent znode /hadoop-ha/nscluster already exists.
Are you sure you want to clear all failover information from
ZooKeeper?
WARNING: Before proceeding, ensure that all HDFS services and
failover controllers are stopped!
Proceed formatting /hadoop-ha/nscluster? (Y or N) y
2022-04-08 10:58:46,639 INFO ha.ActiveStandbyElector: Recursively deleting /hadoop-ha/nscluster from ZK…
2022-04-08 10:58:46,658 INFO ha.ActiveStandbyElector: Successfully deleted /hadoop-ha/nscluster from ZK.
2022-04-08 10:58:46,684 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/nscluster in ZK.
2022-04-08 10:58:46,689 INFO zookeeper.ZooKeeper: Session: 0x300000279580000 closed
2022-04-08 10:58:46,691 INFO zookeeper.ClientCnxn: EventThread shut down for session: 0x300000279580000
2022-04-08 10:58:46,692 INFO tools.DFSZKFailoverController: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DFSZKFailoverController at vm-10-10-0-1/10.10.0.1
************************************************************/
在slave1上登录zookeeper
[zk@lab2 hadoop]$ zkCli.sh
Connecting to localhost:2181
在slave1上登录zookeeper,输入ls / ,发现多了一个hadoop-ha节点,这表示配置应该没有问题
[zk: localhost:2181(CONNECTED) 0] ls /
[zookeeper, hadoop-ha]
启动集群, 在master 输入 sbin/start-dfs.sh
这时候,如果将ative的namenode进程杀掉,另一台应该会从standby变为active状态
如果不切换,详看下面附录中故障排查章节
七、启动yarn,以及历史服务器
测试resourcemanager ha
1、启动resourcemanager
sbin/start-yarn.sh
在web 端输入10.10.10.1:8088自动跳转
[zk@lab1 bin]$ curl 10.10.10.6:8088
This is standby RM. The redirect url is: /
2、启动historyserver
在3.x用./bin/mapred --daemon start historyserver
在2.x用mr-jobhistory-daemon.sh start historyserver
因为MapReduce是运行在yarn上的,所以历史服务器也是运行在yarn上得到
查看服务器是否启动
八、wordcount程序测试
将文本文件上传到hdfs的/input目录。
使用如下命令测试mapreduce的wordcount功能
如果运行成功,mapreduce会将结果输出在/output目录下,/out目录是由mpreduce自动创建的,不能提前创建。
hadoop jar /opt/hadoop-3.2.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount /input/ /output
运行hive或者MapReduce报如下错误的时候
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
yarn.app.mapreduce.am.env
HADOOP_MAPRED_HOME=
f
u
l
l
p
a
t
h
o
f
y
o
u
r
h
a
d
o
o
p
d
i
s
t
r
i
b
u
t
i
o
n
d
i
r
e
c
t
o
r
y
<
/
v
a
l
u
e
>
<
/
p
r
o
p
e
r
t
y
>
<
p
r
o
p
e
r
t
y
>
<
n
a
m
e
>
m
a
p
r
e
d
u
c
e
.
m
a
p
.
e
n
v
<
/
n
a
m
e
>
<
v
a
l
u
e
>
H
A
D
O
O
P
M
A
P
R
E
D
H
O
M
E
=
{full path of your hadoop distribution directory}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=
fullpathofyourhadoopdistributiondirectory</value></property><property><name>mapreduce.map.env</name><value>HADOOPMAPREDHOME={full path of your hadoop distribution directory}
mapreduce.reduce.env
HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}
按照要求修改mapred-site.xml配置文件
yarn.app.mapreduce.am.env HADOOP_MAPRED_HOME=/middleware/hadoop mapreduce.map.env HADOOP_MAPRED_HOME=/middleware/hadoop mapreduce.reduce.env HADOOP_MAPRED_HOME=/middleware/hadoop然后修改yarn-site.xml文件
执行如下命令
[admin@hdpnn1 hadoop]$ hadoop classpath
/middleware/hadoop/etc/hadoop:/middleware/hadoop/share/hadoop/common/lib/:/middleware/hadoop/share/hadoop/common/:/middleware/hadoop/share/hadoop/hdfs:/middleware/hadoop/share/hadoop/hdfs/lib/:/middleware/hadoop/share/hadoop/hdfs/:/middleware/hadoop/share/hadoop/mapreduce/lib/:/middleware/hadoop/share/hadoop/mapreduce/:/middleware/hadoop/share/hadoop/yarn:/middleware/hadoop/share/hadoop/yarn/lib/:/middleware/hadoop/share/hadoop/yarn/
将得到的信息添加到yarn-site.xml文件中
yarn.application.classpath
{刚才执行hadoop classpath命令得到的信息}
将配置文件分发到所有机器,重启集群
九、历史服务器安装
目的:查看历史进程
如果之前没有安装的话可以通过以以下方式进行安装
配置mapred-site.xml,文件在/etc/hadoop目录下,添加以下内容:
注:一个内部端口一个外部端口;
<!--历史服务器web端地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020<alue>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop102:19888<alue>
</property>
将配置文件分发到其他机器
在hadoop102上启动历史服务器
在3.x用mapred --daemon start historyserver
在2.x用mr-jobhistory-daemon.sh start historyserver
因为MapReduce是运行在yarn上的,所以历史服务器也是运行在yarn上得到
查看服务器是否启动
附:故障排查:
发现无法进行HA自动切换
将active的namenode进程杀掉后,standby的namenode无法触发成active
查看 hadoop-zk-zkfc-lab1.log日志文件,会发现有java错误,在发生错误之前他需要 Should fence
如下日志所示
2019-11-19 19:55:10,949 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at lab2/10.10.10.6:8020
2019-11-19 19:55:11,961 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: lab2/10.10.10.6:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2019-11-19 19:55:11,969 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at lab2/10.10.10.6:8020 standby (unable to connect)
java.net.ConnectException: Call From lab1/10.10.10.5 to lab2:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:833)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:757)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1549)
at org.apache.hadoop.ipc.Client.call(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1388)
at org.apache.hadoop.ipc.ProtobufRpcEngine
I
n
v
o
k
e
r
.
i
n
v
o
k
e
(
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
.
j
a
v
a
:
233
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
P
r
o
t
o
b
u
f
R
p
c
E
n
g
i
n
e
Invoker.invoke(ProtobufRpcEngine.java:233) at org.apache.hadoop.ipc.ProtobufRpcEngine
Invoker.invoke(ProtobufRpcEngine.java:233)atorg.apache.hadoop.ipc.ProtobufRpcEngineInvoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:113)
at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:520)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:510)
at org.apache.hadoop.ha.ZKFailoverController.access
1100
(
Z
K
F
a
i
l
o
v
e
r
C
o
n
t
r
o
l
l
e
r
.
j
a
v
a
:
60
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
h
a
.
Z
K
F
a
i
l
o
v
e
r
C
o
n
t
r
o
l
l
e
r
1100(ZKFailoverController.java:60) at org.apache.hadoop.ha.ZKFailoverController
1100(ZKFailoverController.java:60)atorg.apache.hadoop.ha.ZKFailoverControllerElectorCallbacks.fenceOldActive(ZKFailoverController.java:933)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:992)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:891)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at org.apache.zookeeper.ClientCnxn
E
v
e
n
t
T
h
r
e
a
d
.
p
r
o
c
e
s
s
E
v
e
n
t
(
C
l
i
e
n
t
C
n
x
n
.
j
a
v
a
:
610
)
a
t
o
r
g
.
a
p
a
c
h
e
.
z
o
o
k
e
e
p
e
r
.
C
l
i
e
n
t
C
n
x
n
EventThread.processEvent(ClientCnxn.java:610) at org.apache.zookeeper.ClientCnxn
EventThread.processEvent(ClientCnxn.java:610)atorg.apache.zookeeper.ClientCnxnEventThread.run(ClientCnxn.java:508)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.ipc.Client
C
o
n
n
e
c
t
i
o
n
.
s
e
t
u
p
C
o
n
n
e
c
t
i
o
n
(
C
l
i
e
n
t
.
j
a
v
a
:
700
)
a
t
o
r
g
.
a
p
a
c
h
e
.
h
a
d
o
o
p
.
i
p
c
.
C
l
i
e
n
t
Connection.setupConnection(Client.java:700) at org.apache.hadoop.ipc.Client
Connection.setupConnection(Client.java:700)atorg.apache.hadoop.ipc.ClientConnection.setupIOstreams(Client.java:804)
at org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:421)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1606)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
… 15 more
查看WARN日志,发现fencing不成功,而且找不到fuser
2019-11-19 19:18:43,057 WARN org.apache.hadoop.ha.SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: command not found
2019-11-19 19:18:43,074 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
综合以上因素,应该是安装上fuser就可以了因此使
[zk@lab1 logs]$ yum whatprovides fuser
Loaded plugins: fastestmirror
Determining fastest mirrors
- base: mirrors.tuna.tsinghua.edu.cn
- extras: mirrors.tuna.tsinghua.edu.cn
- updates: mirrors.tuna.tsinghua.edu.cn
base/7/x86_64/filelists_db | 7.3 MB 00:00:06
extras/7/x86_64/filelists_db | 207 kB 00:00:00
updates/7/x86_64/filelists_db | 2.7 MB 00:00:15
psmisc-22.20-16.el7.x86_64 : Utilities for managing processes on your system
Repo : base
Matched from:
Filename : /usr/sbin/fuser
因此直接使用yum install -y psmisc ,安装完成后先执行 stop-dfs.sh ,然后再执行start-dfs.sh。问题解决