需求
Hadoop运行模式可分为:
本地模式:本地测试。
伪分布式模式:小公司可能会用。
完全分布式模式:大公司经常用。
安装一套3节点的完全分布式hadoop集群。如下:
NameNode和SecondaryNameNode不要安装在同一台服务器。
出于内存考虑,ResourceManager不要和NameNode、SecondaryNameNode配置在同一台机器上。
虚拟机安装
先装一个模板虚拟机,之后集群中的虚拟机可以直接使用克隆虚拟机的方式创建。
linux 版本: centos 7.5
网络配置
虚拟网络设置
网卡设置
虚拟机网络设置
/etc/sysconfig/network-scripts/ifcfg-ens33
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.10.100
GATEWAY=192.168.10.2
DNS1=192.168.10.2
配置主机名称
/etc/hostname
hadoop100
主机映射
/etc/hosts
192.168.10.100 hadoop100
192.168.10.101 hadoop101
192.168.10.102 hadoop102
192.168.10.103 hadoop103
192.168.10.104 hadoop104
重启虚拟机
reboot
# 检查虚拟机ip
ip addr
ping www.baidu.com
安装依赖
# 额外包
yum install -y epel-release
# 网络工具,ifconfig工具,可选安装
yum intall -y net-tools
# 文本编辑工具,可选安装
yum install -y vim
# 远程同步工具,可选安装
yum install -y rsync
关闭防火墙
# 停止防火墙
systemctl stop firewalld
# 停止防火墙开机启动
systemctl disable firewalld.service
创建普通用户
安装Centos的时候,如果已经建过普通用户的话,就不用再创建了
useradd atiaisi
passwd atiaisi
给普通用户atiaisi赋予sudoer权限
# /etc/sudoers
# NOPASSWD:ALL的意思是普通用户下,使用sudo不再输入密码
atiaisi ALL=(ALL) NOPASSWD:ALL
创建目录,作为之后hadoop软件的安装目录
mkdir -p /opt/module
mkdir -p /opt/software
chown -R atiaisi:atiaisi /opt/module
chown -R atiaisi:atiaisi /opt/software
以atiaisi用户登录虚拟机
上传hadoop-3.1.3.tar.gz、jdk-8u212-linux-x64.tar.gz到/opt/software目录下
配置JDK
解压文件
tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/module
配置jdk环境变量
# /etc/profile.d/my_env.sh
# JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_212
export PATH=$PATH:$JAVA_HOME/bin
使环境变量生效
source /etc/profile
安装hadoop
解压文件
tar -zxvf hadoop-3.1.3.tar.gz -C /opt/module/
配置hadoop环境变量
# /etc/my_env.sh中追加
# HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-3.1.3
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
使环境变量生效
source /etc/profile
hadoop安装目录下的目录结构
[atiaisi@hadoop100 hadoop-3.1.3]$ tree /opt/module/hadoop-3.1.3/ -I "share|lib|libexec|include|NOTICE.txt|README.txt|LICENSE.txt"
/opt/module/hadoop-3.1.3/
├── bin
│ ├── container-executor
│ ├── hadoop
│ ├── hadoop.cmd
│ ├── hdfs # hdfs命令
│ ├── hdfs.cmd
│ ├── mapred # mapreduce命令
│ ├── mapred.cmd
│ ├── test-container-executor
│ ├── yarn # yarn命令
│ └── yarn.cmd
├── etc
│ └── hadoop
│ ├── capacity-scheduler.xml
│ ├── configuration.xsl
│ ├── container-executor.cfg
│ ├── core-site.xml
│ ├── hadoop-env.cmd
│ ├── hadoop-env.sh
│ ├── hadoop-metrics2.properties
│ ├── hadoop-policy.xml
│ ├── hadoop-user-functions.sh.example
│ ├── hdfs-site.xml # hdfs配置文件
│ ├── httpfs-env.sh
│ ├── httpfs-log4j.properties
│ ├── httpfs-signature.secret
│ ├── httpfs-site.xml
│ ├── kms-acls.xml
│ ├── kms-env.sh
│ ├── kms-log4j.properties
│ ├── kms-site.xml
│ ├── log4j.properties
│ ├── mapred-env.cmd
│ ├── mapred-env.sh
│ ├── mapred-queues.xml.template
│ ├── mapred-site.xml # mapreduce配置文件
│ ├── shellprofile.d
│ │ └── example.sh
│ ├── ssl-client.xml.example
│ ├── ssl-server.xml.example
│ ├── user_ec_policies.xml.template
│ ├── workers
│ ├── yarn-env.cmd
│ ├── yarn-env.sh
│ ├── yarnservice-log4j.properties
│ └── yarn-site.xml # yarn配置文件
└── sbin
├── distribute-exclude.sh
├── FederationStateStore
│ ├── MySQL
│ │ ├── dropDatabase.sql
│ │ ├── dropStoreProcedures.sql
│ │ ├── dropTables.sql
│ │ ├── dropUser.sql
│ │ ├── FederationStateStoreDatabase.sql
│ │ ├── FederationStateStoreStoredProcs.sql
│ │ ├── FederationStateStoreTables.sql
│ │ └── FederationStateStoreUser.sql
│ └── SQLServer
│ ├── FederationStateStoreStoreProcs.sql
│ └── FederationStateStoreTables.sql
├── hadoop-daemon.sh
├── hadoop-daemons.sh
├── httpfs.sh
├── kms.sh
├── mr-jobhistory-daemon.sh
├── refresh-namenodes.sh
├── start-all.cmd
├── start-all.sh
├── start-balancer.sh
├── start-dfs.cmd
├── start-dfs.sh # 启动dhfs
├── start-secure-dns.sh
├── start-yarn.cmd # 启动yarn
├── start-yarn.sh
├── stop-all.cmd
├── stop-all.sh
├── stop-balancer.sh
├── stop-dfs.cmd
├── stop-dfs.sh
├── stop-secure-dns.sh
├── stop-yarn.cmd
├── stop-yarn.sh
├── workers.sh
├── yarn-daemon.sh
└── yarn-daemons.sh
8 directories, 78 files
至此,虚拟机模板创建完成。
克隆虚拟机
root用户登录。
克隆三个虚拟机,分别是:hadoop102, hadoop103, hadoop104.
修改每个虚拟机的IP、主机名称、主机映射。
重启虚拟机后,检查是否配置正确,确保每一步都是正确的,再执行后面的配置。
配置ssh免密
atiaisi用户登录。
在A机器hostnameA上通过ssh访问B机器hostnameB,ssh免密操作
在A机器上执行:
# 生成密钥对 ssh-keygen -t rsa # 将公钥拷贝到B机器上 ssh-copy-id hostnameB
这样A访问B就不用再输入密码了。
如果B要访问A,在B机器上执行以上相同操作。
xsync同步脚本编写
该脚本的作用是可以同时给各个子节点发送文件。
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Argument!
exit;
fi
# 2. 遍历集群所有机器
for host in hadoop102 hadoop103 hadoop104
do
echo ========== $host ==========
# 3. 遍历所有目录
for file in $@
do
# 4. 判断文件是否存在
if [ -e $file ]
then
# 5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
# 6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
配置环境变量
/etc/profile.d/my_env.sh
# XSYNC_HOME
export XSYNC_HOME=/home/atiaisi
export PATH=$PATH:$XSYNC_HOME/bin
集群配置
atiaisi用户登录。
配置文件目录:/opt/module/hadoop-3.1.3/etc/hadoop
core-site.xml
公共配置(核心配置文件)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:8020</value>
</property>
<!-- 指定hadoop数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-3.1.3/data</value>
<description>A base for other temporary directories.</description>
</property>
<!-- 配置HDFS网页登录使用的静态用户为atiaisi -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>atiaisi</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- nn web端访问地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<!-- 2nn web端访问地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定MR走shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定RM地址 -->
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<!-- 环境变量继承 -->
<property>
<description>Environment variables that containers may override rather than use NodeManager's default.</description>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<!-- 如果不配置日志聚集功能,web页面上的yarn历史任务的日志会查看不了。 -->
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 设置日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留时间为7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定MapReduce程序运行在Yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 历史服务器端地址 -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<!-- 历史服务器web端地址 -->
<!-- 如果不配置,web页面上查看yarn任务历史的时候,会跳转不过去。-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop102:19888</value>
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
</configuration>
配置workers
# /opt/module/hadoop-3.1.3/etc/hadoop/workers
hadoop102
hadoop103
hadoop104
同步配置文件到所有节点
xsync /opt/module/hadoop-3.1.3/etc/hadoop/
群起集群
启动集群
- 格式化MameNode
只需要在hadoop102节点上执行,只有集群第一次启动的时候执行
hdfs namenode -format
- 启动HDFS
只需要在hadoop102节点上执行
start-dfs.sh
查看集群状态,观察NameNode,DataNode, 2NN是否符合更开始的集群搭建需求。
jps
- 启动yarn
只需要在hadoop103节点上执行
start-yarn.sh
再次观察集群状态 jps
,查看ResourceManger, NodeManager是否符合需求
- 在hadoop102启动历史服务器
# 启动历史服务器
mapred --daemon start historyserver
# 查看历史服务器是否启动
jps
5) web端查看HDFS的NameNode
浏览器中输入:http://hadoop102:9870
可以查看HDFS上存储的数据信息。
6) web端查看yarn的ResourceManager
浏览器中输入: http://hadoop103:8088
可以查看YARN上运行的Job信息。
测试集群
上传文件到集群
# dhfs系统中创建文件夹
hadoop fs -mkdir /packages
# 上传文件
hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /packages
hadoop fs -put /tmp/words.txt /
查看上传的文件
发起一个wordcount程序,出发MR任务。
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount /words.txt /output
任务记录
任务记录详细信息
任务记录详细日志