Hadoop集群搭建(一)
准备工作
编写集群分发脚本
1.先创建bin目录,在/home/hadoop/bin
下存放的脚本,hadoop用户可以在系统任何地方直接执行
[hadoop@node120 ~]$ mkdir bin
2.在/home/hadoop/bin
目录下创建xsync
文件,以便全局调用
[hadoop@node120 bin]$ cd /home/hadoop/bin/
[hadoop@node120 bin]$ vim xsync
在该文件中编写如下代码
#!/bin/bash
#1.判断参数个数
if [ $# -lt 1 ]
then
echo NOT Enough Argument!
exit;
fi
#2.遍历集群所有机器
for host in node120 node130 node140
do
echo ================== $host ===================
#3.遍历所有目录,挨个发送
for file in $@
do
#4 判断文件是否存在
if [ -e $file ]
then
#5.获取父目录
pdir=$(cd -P $(dirname $file);pwd)
#6.获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
3.修改脚本具有执行权限
[hadoop@node120 bin]$ chmod +x xsync
4.测试脚本
[hadoop@node120 bin]$ xsync xsync
SSH无密登录配置
1.node120上生成公钥和私钥
[hadoop@node120 ~]$ ssh-keygen -t rsa
然后连敲三个回车,就会生成两个文件id_rsa(私钥)、id_rsa.pub(公钥)
2.将node120公钥拷贝到要免密登录的目标机器上
[hadoop@node120 .ssh]$ ssh-copy-id node120
[hadoop@node120 .ssh]$ ssh-copy-id node130
[hadoop@node120 .ssh]$ ssh-copy-id node140
3.node130上生成公钥和私钥
[hadoop@node130 ~]$ ssh-keygen -t rsa
4.将node130公钥拷贝到要免密登录的机器上
[hadoop@node130 ~]$ ssh-copy-id node120
[hadoop@node130 ~]$ ssh-copy-id node130
[hadoop@node130 ~]$ ssh-copy-id node140
JDK安装
1.卸载现有JDK(3台节点)
sudo rpm -qa | grep -i java | xargs -n1 sudo rpm -e --nodeps
由于我是最小化安装,不需要执行
1)rpm -qa:表示查询所有已经安装的软件包
2)grep -i:表示过滤时不区分大小写
3)xargs -n1:表示一次获取上次执行结果的一个值
3)rpm -e --nodeps:表示卸载软件,且不检查依赖
2.将jdk安装包上传到/opt/software
(此处使用SecureFX)
3.解压jdk到/opt/module
目录下
[hadoop@node120 java]$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module/
4.配置JDK环境变量
(1)新建/etc/profile.d/my_env.sh
文件
[hadoop@node120 java]$ sudo vi /etc/profile.d/my_env.sh
添加如下内容
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
(2)让环境变量生效
[hadoop@node120 java]$ source /etc/profile.d/my_env.sh
5.测试JDK是否安装成功
[hadoop@node120 java]$ java -version
看到
java version "1.8.0_212"
Java(TM) SE Runtime Environment (build 1.8.0_212-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.212-b10, mixed mode)
表示Java正常安装
6.分发JDK
[hadoop@node120 ~]$ xsync /opt/module/jdk1.8.0_212/
7.分发环境变量配置文件
[hadoop@node120 ~]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
8.分别在node130、node140上执行source
编写集群所有进程查看脚本
(1).在/etc/hadoop/bin
目录下创建脚本xcall.sh
[hadoop@node120 bin]$ vi xcall.sh
(2).在脚本中编写如下内容
#!/bin/bash
for i in node120 node130 node140
do
echo ------ $i ------
ssh $i "$*"
done
(3).修改脚本执行权限
[hadoop@node120 bin]$ chmod 777 xcall.sh
(4).启动脚本
[hadoop@node120 bin]$ xcall.sh jps
------ node120 ------
1522 Jps
------ node130 ------
1419 Jps
------ node140 ------
1300 Jps
Hadoop安装
1.1 Hadoop部署
1.集群规划
服务器node120 | 服务器node130 | 服务器node140 | |
---|---|---|---|
HDFS | NameNodeDataNode | DataNode | DataNodeSecondaryNameNode |
Yarn | NodeManager | ResourcemanagerNodeManager | NodeManager |
2.上传软件到/opt/software/hadoop
3.解压安装文件到/opt/module
[hadoop@node120 hadoop]$ tar zxvf hadoop-3.1.3.tar.gz -C /opt/module/
4.将Hadoop添加到环境变量
(1).获取Hadoop安装路径
[hadoop@node120 hadoop-3.1.3]$ pwd
/opt/module/hadoop-3.1.3
(2).打开/etc/profile.d/my_env.sh
文件
[hadoop@node120 hadoop-3.1.3]$ sudo vim /etc/profile.d/my_env.sh
添加内容
##HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
(3).分发环境变量文件
[hadoop@node120 hadoop-3.1.3]$ sudo /home/hadoop/bin/xsync /etc/profile.d/my_env.sh
(4).source一下生效(3台节点)
[hadoop@node120 hadoop-3.1.3]$ source /etc/profile.d/my_env.sh
[hadoop@node130 ~]$ source /etc/profile.d/my_env.sh
[hadoop@node140 ~]$ source /etc/profile.d/my_env.sh
1.2 配置集群
1)核心配置文件
配置core-site.xml
进入配置文件存放路径
[hadoop@node120 hadoop]$ pwd
/opt/module/hadoop-3.1.3/etc/hadoop
[hadoop@node120 hadoop]$ vim core-site.xml
文件内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node120:8020</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
</property>
<!-- 配置HDFS网页登录使用的静态用户为hadoop -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理访问的主机节点 -->
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理用户所属组 -->
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<!-- 配置该hadoop(superUser)允许通过代理的用户 -->
<property>
<name>hadoop.proxyuser.hadoop.users</name>
<value>*</value>
</property>
</configuration>
2)HDFS配置文件
配置hdfs-site.xml
[hadoop@node120 hadoop]$ vim hdfs-site.xml
文件内容如下:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- nn web端访问地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>node120:9870</value>
</property>
<!-- 2nn web端访问地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node140:9868</value>
</property>
<!-- 测试环境指定HDFS副本的数量3 -->
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
</configuration>
3)yarn配置文件
配置yarn-site.xml
[hadoop@node120 hadoop]$ vim yarn-site.xml
文件内容如下:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定MR走shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<name>yarn.scheduler.maximum-allocation-mb</name>
<name>yarn.scheduler.maximum-allocation-mb</name>
<name>yarn.scheduler.maximum-allocation-mb</name>
</property>
<!-- 指定ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node130</value>
</property>
<!-- 环境变量的继承 -->
<property>
<name>yarn.nodemanager.env-whitelist</name>
</property>
<!-- yarn容器允许分配的最大最小内存 -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
</property>
<!-- yarn容器允许管理的物理内存大小 -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<!-- 关闭yarn对虚拟内存的限制检查 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://node120:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
4)mapreduce配置文件
配置mapred-site.xml
[hadoop@node120 hadoop]$ vim mapred-site.xml
文件内容如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node120:10020</value>
</property>
<!-- 历史服务器web端地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node120:19888</value>
</property>
</configuration>
5)配置workers
[hadoop@node120 hadoop]$ vim workers
在文件中增加
node120
node130
node140
注意:该文件中添加的内容结尾不允许有空格,文件中不允许有空行.
1.3 分发Hadoop
[hadoop@node120 hadoop]$ xsync /opt/module/hadoop-3.1.3/
1.4 群起集群
1)启动集群
如果第一次启动,需要在node120节点格式化namenode
(格式化之前一定要停止上次启动的所有namenode和datanode进程,然后再删除data和log数据)
[hadoop@node120 hadoop-3.1.3]$ bin/hdfs namenode -format
2)启动HDFS
[hadoop@node120 hadoop-3.1.3]$ sbin/start-dfs.sh
3)在配置了ResourceManager的节点(node130)启动yarn
[hadoop@node130 hadoop-3.1.3]$ sbin/start-yarn.sh
4)web端查看HDFS的web页面:http://node120:9870
[hadoop@node120 hadoop-3.1.3]$ xcall.sh jps
------ node120 ------
12400 NodeManager
12513 Jps
12081 DataNode
11960 NameNode
------ node130 ------
11655 ResourceManager
12119 Jps
11483 DataNode
11772 NodeManager
------ node140 ------
11350 DataNode
11463 SecondaryNameNode
11547 NodeManager
11660 Jps
查看进程是否与规划相同
1.5 Hadoop群起脚本
(1).进入/home/hadoop/bin
目录
[hadoop@node120 ~]$ cd bin/
[hadoop@node120 bin]$ pwd
/home/hadoop/bin
(2).编辑脚本
[hadoop@node120 bin]$ vim hdp.sh
输入如下内容:
#!/bin/bash
if [ $# -lt 1 ]
then
echo "No Args Input..."
exit;
case $1 in
"start")
echo "==================启动hadoop集群=============="
echo "------------------启动hdfs--------------------"
ssh node120 "/opt/module/hadoop-3.1.3/sbin/start-dfs.sh"
echo "------------------启动yarn--------------------"
ssh node130 "/opt/module/hadoop-3.1.3/sbin/start-yarn.sh"
echo "------------------启动historyserver-----------"
ssh node120 "/opt/module/hadoop-3.1.3/bin/mapred --daemon start historyserver"
;;
"stop")
echo "==================关闭hadoop集群=============="
echo "------------------关闭historyserver-----------"
ssh node120 "/opt/module/hadoop-3.1.3/bin/mapred --daemon stop historyserver"
echo "------------------关闭yarn--------------------"
ssh node130 "/opt/module/hadoop-3.1.3/sbin/stop-yarn.sh"
echo "------------------关闭hdfs--------------------"
ssh node120 "/opt/module/hadoop-3.1.3/sbin/stop-dfs.sh"
;;
*)
echo "Input Args Error....."
;;
esac
1.6 配置Hadoop支持LZO压缩
1)hadoop-lzo编译
hadoop本身并不支持lzo压缩,故需要使用twitter提供的hadoop-lzo开源组件。hadoop-lzo需依赖hadoop和lzo进行编译,编译步骤如下:
0. 环境准备
maven(下载安装,配置环境变量,修改sitting.xml加阿里云镜像)
gcc-c++
zlib-devel
autoconf
automake
libtool
通过yum安装即可,yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool
1. 下载、安装并编译LZO
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz
tar -zxvf lzo-2.10.tar.gz
cd lzo-2.10
./configure -prefix=/usr/local/hadoop/lzo/
make
make install
2. 编译hadoop-lzo源码
2.1 下载hadoop-lzo的源码,下载地址:https://github.com/twitter/hadoop-lzo/archive/master.zip
2.2 解压之后,修改pom.xml
<hadoop.current.version>3.1.3</hadoop.current.version>
2.3 声明两个临时环境变量
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include
export LIBRARY_PATH=/usr/local/hadoop/lzo/lib
2.4 编译
进入hadoop-lzo-master,执行maven编译命令
mvn package -Dmaven.test.skip=true
2.5 进入target,hadoop-lzo-0.4.21-SNAPSHOT.jar 即编译成功的hadoop-lzo组件
2)将编译好的jar包放入hadoop/share/hadoop/common/
[hadoop@node120 hadoop]$ cp hadoop-lzo-0.4.20.jar /opt/module/hadoop-3.1.3/share/hadoop/common/
3)同步jar包到其他节点
[hadoop@node120 common]$ xsync hadoop-lzo-0.4.20.jar
4)core-site.xml
增加配置支持LZO压缩
[hadoop@node120 hadoop-3.1.3]$ cd etc/hadoop/
[hadoop@node120 hadoop]$ vim core-site.xml
增加如下内容
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
5)同步core-site.xml
到node130、node140
[hadoop@node120 hadoop]$ xsync core-site.xml
6)重启集群
[hadoop@node120 hadoop]$ hdp.sh stop
[hadoop@node120 hadoop]$ hdp.sh start
7)测试-数据准备
[hadoop@node120 hadoop]$ hadoop fs -mkdir /input
[hadoop@node120 hadoop-3.1.3]$ hadoop fs -put README.txt /input
8)测试-压缩
[hadoop@node120 hadoop-3.1.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec /input /output
Zookeeper安装
1. 集群规划
服务器node120 | 服务器node130 | 服务器node140 | |
---|---|---|---|
Zookeeper | Zookeeper | Zookeeper | Zookeeper |
2.解压安装
(1).解压Zookeeper安装包到/opt/module
[hadoop@node120 zookeeper]$ tar zxvf apache-zookeeper-3.5.7-bin.tar.gz -C /opt/module/
(2).修改/opt/module/apache-zookeeper-3.5.7-bin
名称为zookeeper-3.5.7
[hadoop@node120 module]$ mv apache-zookeeper-3.5.7-bin/ zookeeper-3.5.7
(3).同步到其他节点
[hadoop@node120 module]$ xsync zookeeper-3.5.7/
3.配置服务器编号
(1).在/opt/module/zookeeper-3.5.7
这个目录下创建zkData
[hadoop@node120 module]$ cd zookeeper-3.5.7/
[hadoop@node120 zookeeper-3.5.7]$ mkdir zkData
(2).在/opt/module/zookeeper-3.5.7/zkData
目录下创建一个myid
的文件
[hadoop@node120 zookeeper-3.5.7]$ cd zkData/
[hadoop@node120 zkData]$ vi myid
在文件添加与server对应的编号:
2
(3).拷贝配置好的zookeeper到其他机器上
[hadoop@node120 zkData]$ xsync myid
并分别在node130、node140上修改myid
文件中内容为3、4
4.配置zoo.cfg
文件
(1).重命名/opt/module/zookeeper-3.5.7/conf
这个目录下的zoo_sample.cfg
为zoo.cfg
[hadoop@node120 conf]$ cp zoo_sample.cfg zoo.cfg
(2).打开zoo.cfg
文件
[hadoop@node120 conf]$ vim zoo.cfg
修改数据存储路径配置
dataDir=/opt/module/zookeeper-3.5.7/zkData
增加如下配置
#######################cluster##########################
server.2=node120:2888:3888
server.3=node130:2888:3888
server.4=node140:2888:3888
(3).同步配置文件
[hadoop@node120 conf]$ xsync zoo.cfg
(4).配置参数解读
server.A=B:C:D。
A是一个数字,表示这个是第几号服务器;
集群模式下配置一个文件myid,这个文件在dataDir目录下,这个文件里面有一个数据就是A的值,Zookeeper启动时读取此文件,拿到里面的数据与zoo.cfg里面的配置信息比较从而判断到底是哪个server。
B是这个服务器的地址;
C是这个服务器Follower与集群中的Leader服务器交换信息的端口;
D是万一集群中的Leader服务器挂了,需要一个端口来重新进行选举,选出一个新的Leader,而这个端口就是用来执行选举时服务器相互通信的端口。
5.集群操作
(1).分别启动Zookeeper
[hadoop@node120 zookeeper-3.5.7]$ bin/zkServer.sh start
[hadoop@node130 zookeeper-3.5.7]$ bin/zkServer.sh start
[hadoop@node140 zookeeper-3.5.7]$ bin/zkServer.sh start
(2).查看状态
[hadoop@node120 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
[hadoop@node130 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: leader
[hadoop@node140 zookeeper-3.5.7]$ bin/zkServer.sh status
ZooKeeper JMX enabled by default
Using config: /opt/module/zookeeper-3.5.7/bin/../conf/zoo.cfg
Client port found: 2181. Client address: localhost.
Mode: follower
6.客户端命令行操作
命令基本语法 | 功能描述 |
---|---|
help | 显示所有操作命令 |
ls path | 使用 ls 命令来查看当前znode的子节点-w 监听子节点变化-s 附加次级信息 |
create | 普通创建-s 含有序列-e 临时(重启或者超时消失) |
get path | 获得节点的值-w 监听节点内容变化-s 附加次级信息 |
set | 设置节点的具体值 |
stat | 查看节点状态 |
delete | 删除节点 |
deleteall | 递归删除节点 |
启动客户端
[hadoop@node130 zookeeper-3.5.7]$ bin/zkCli.sh
7.zk集群启停脚本
(1).在node120的/home/hadoop/bin
目录下创建脚本
[hadoop@node120 bin]$ vim zk.sh
在脚本中编写如下内容:
#!/bin/bash
case $1 in
"start" ){
for i in node120 node130 node140
do
echo -------- zookeeper $i 启动-------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh start"
done
};;
"stop"){
for i in node120 node130 node140
do
echo -------- zookeeper $i 关闭-------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh stop"
done
};;
"status"){
for i in node120 node130 node140
do
echo -------- zookeeper $i 状态 -------
ssh $i "/opt/module/zookeeper-3.5.7/bin/zkServer.sh status"
done
};;
esac
(2).增加脚本执行权限
[hadoop@node120 bin]$ chmod u+x zk.sh