Hadoop笔记记录(零)
文章目录
学习尚硅谷Hadoop课程记录
1.搭建
1.1三种模式
- 本地模式
- 数据存储在linux本地,依赖linux帮助进行存储
- 测试偶尔用
- 伪分布式模式
- 数据存储在HDFS中
- 完全分布式模式
- 数据存储在HDFS/多台服务器工作
基于完全分布式搭建集群
环境:
- CentOS7
jdk1.8
hadoop3.3.1(记录:为兼容windows版本,换成3.1.3版本)
必要安装包
- epel-release:redhat系一个扩展仓库
- net-tool
- vim
关闭防火墙
创建普通用户并赋权
在/opt目录下创建文件夹,修改所属组和所属用户
改静态IP
改用户主机名
改主机名映射
配ssh
卸载原装jdk
安装jdk
安装hadoop
配置环境变量
1.2测试
- 测试本地模式wordcount案例
# 进入hadoop安装目录
cd /opt/hadoop
# 创建wcinput文件夹
mkdir wcinput
# 随便写点数据
vim word.txt
# 执行
# hadoop命令 用jar包形式 执行后面这个路径里的examples这个jar包中的wordcound案例,数据输入路径wcinput/,数据输出位置当前路径下的wcoutput/(这个路径必须不存在否则报错)
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount wcinput/ ./wcoutput
随便写的数据word.txt
输出结果
# 进入wcoutput
cd wcoutput
cat part-r-00000
2.常用命令
2.1scp安全拷贝
实现服务器与服务器之间的数据拷贝
基本语法:
# 命令-r 递归 要拷贝的文件路径/名称 目的地用户@主机:目的地路径/名称
scp -r $pdir/$fname $user@$host:$pdir/$fname
案例:
注意,传送过去的文件取决于你用哪个用户连接的目的服务器
# 从本地向远程传送文件
scp -r jdk11 root@192.168.111.102:/opt/java/jdk11
# 从远程拷贝到本地
scp -r root@192.168.111.102:/opt/jdk11 /opt/jdk
# 也可以从远程拷贝到远程
scp -r root@192.168.111.102:/opt/jdk root@192.168.111.103:/opt/jdk
2.2rsync远程同步工具
第一次同步相当于拷贝
第二次之后就只更新修改的文件
命令类似scp
# -a归档拷贝 -v现实过程
rsync -av $pdir/$fname $user@$host:$pdir/$f
2.3 xsync脚本编写
#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
echo Not Enough Arguement!
exit;
fi
#2. 遍历集群所有机器
for host in rhnode2 rhnode3 rhnode4
do
echo ==================== $host ====================
#3. 遍历所有目录,挨个发送
for file in $@
do
#4. 判断文件是否存在
if [ -e $file ]
then
#5. 获取父目录
pdir=$(cd -P $(dirname $file); pwd)
#6. 获取当前文件的名称
fname=$(basename $file)
ssh $host "mkdir -p $pdir"
rsync -av $pdir/$fname $host:$pdir
else
echo $file does not exists!
fi
done
done
注意赋予执行权限,并添加环境变量
3.集群配置
3.1规划
要点:
- NameNode和SecondaryNameNode不要安装在同一台服务器
- ResourceManager也很消耗内存,不要和NameNode,SecondaryNameNode配置在同一台机器上
3.2配置
3.2.1配置文件说明
核心四个配置文件,存放位置$HADOOP_HOME/etc/hadoop
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
3.2.2 配置集群
- 核心配置文件core-site.xml
vim $HADOOP_HOME/etc/hadoop/core-site.xml
文件内容如下
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定 NameNode 的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://rhnode2:8020</value>
</property>
<!-- 指定 hadoop 数据的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/data</value>
</property>
<!-- 配置HDFS网页登录使用的静态用户为rh -->
<property>
<name>hadoop.http.staticuser.user</name>
<value>rh</value>
</property>
</configuration>
这里的8020端口是内部的通讯端口,比如和ScrendaryNameNode
- 配置HDFS配置文件hdfs-site.xml
vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--NameNode web端访问地址-->
<property>
<name>dfs.namenode.http-address</name>
<value>rhnode2:9870</value>
</property>
<!--SecondaryNameNode web端访问地址-->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>rhnode4:9868</value>
</property>
</configuration>
- YARN配置文件
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定 MR 走 shuffle -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 ResourceManager 的地址-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>rhnode3</value>
</property>
</configuration>
- MapReduce配置文件
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定 MapReduce 程序运行在 Yarn 上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!-- 添加如下配置,否则会出现无法加载主类的问题 -->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
- 向各服务器分发
xsync /opt/hadoop/etc/hadoop/
# 输出
==================== rhnode2 ====================
sending incremental file list
sent 1,003 bytes received 18 bytes 2,042.00 bytes/sec
total size is 111,360 speedup is 109.07
==================== rhnode3 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml
sent 3,045 bytes received 139 bytes 6,368.00 bytes/sec
total size is 111,360 speedup is 34.97
==================== rhnode4 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml
sent 3,045 bytes received 139 bytes 6,368.00 bytes/sec
total size is 111,360 speedup is 34.97
3.2.3 群起集群
- 配置workers
vim /opt/hadoop/etc/hadoop/workers
删掉原来的localhost,增加如下内容
rhnode2rhnode3rhnode4
注意:结尾不能有任何空格,文件中不允许有空行,否则之后扫描会出问题
分发一下
xsync /opt/hadoop/etc/hadoop/workers
-
启动集群
1). 如果是集群第一次启动,需要在rhnode2(配置的NameNode服务器)节点格式化NameNode
注意:格式化 NameNode,会产生新的集群 id,导致 NameNode 和 DataNode 的集群 id 不一致,集群找不到已往数据。如果集群在运行过程中报错,需要重新格式化 NameNode 的话,一定要先停止 namenode 和 datanode 进程,并且要删除所有机器的 data 和 logs 目录,然后再进行格式化。
hdfs namenode -format
根据配置会在/opt/hadoop中生成data和logs文件夹
记录一下VERSION
cat data/dfs/name/current/VERSION
2). 启动HDFS
sbin/start-dfs.sh
然后在三个节点上结合规划查看一下是否启动正确
rhnode2
rhnode3
rhnode4
3). 在配置了ResourceManager的节点(rhnode3)启动yarn
sbin/start-yarn.sh
-
测试
1)上传文件
hadoop fs -mkdir /inputhadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
2)查看
可以在web端查看
可以在本地查看
cat /opt/hadoop/data/dfs/data/current/BP-457611527-192.168.111.102-1627898916141/current/finalized/subdir0/subdir0/blk_1073741825
3)执行wordcount程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
记录:此处报错,参见问题记录一,二
3.2崩溃重新格式化
# 杀进程kill -p xxx# 查看jps# 删除data和logs文件夹rm -rf data/ logs/# 格式化集群hdfs namenode -format
3.3配置历史服务器
目的:当一个任务执行完毕后web界面关闭后无法重新查看,为了查看历史执行任务
- 配置mapred-site.xml
vim mapred-site.xml
添加如下配置:
<!-- 历史服务器端地址 --><property> <name>mapreduce.jobhistory.address</name> <value>rhnode2:10020</value></property><!-- 历史服务器 web 端地址 --><property> <name>mapreduce.jobhistory.webapp.address</name> <value>rhnode2:19888</value></property>
- 分发配置
xsync etc/hadoop/mapred-site.xml
- 在配置的节点上启动历史服务器
mapred --daemon start historyserver
- jps查看
- web端查看
http://rhnode2:19888/jobhistory
3.4开启日志聚集功能
应用运行完成以后,将程序运行日志信息上传到hdfs系统上
集合三个节点上所有的日志到一个节点
注意:开启日志聚集功能,需要重新启动NodeManager,ResourceManager和HistoryServer
- 配置yarn-site.xml
添加如下配置
<!-- 开启日志聚集功能 --><property> <name>yarn.log-aggregation-enable</name> <value>true</value></property><!-- 设置日志聚集服务器地址 --><property> <name>yarn.log.server.url</name> <value>http://rhnode2:19888/jobhistory/logs</value></property><!-- 设置日志保留时间为 7 天 --><property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value></property>
- 分发配置
xsync etc/hadoop/yarn-site.xml
- 关闭NodeManager,ResourceManager,HistoryServer
sbin/stop-yarn.shmapred --daemon stop historyserver
- 启动NodeManager,ResourceManager,HistoryServer
start-yarn.shmapred --daemon start historyserver
- 删除HDFS上已经存在的输出文件
hadoop fs -rm -r /output
- 执行wordcount程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
- 查看日志
http://rhnode2:19888/jobhistory
4.常用脚本
4.1集群启停方式
-
各个模块分开启动停止
1)整体启动停止HDFS
start-dfs.sh/stop-dfs.sh
2)整体启动停止YARN
start-yarn.sh/stop-yarn.sh
-
各个服务组件逐一启动停止
1)分别启动停止HDFS组件
hdfs --daemon start/stop namenode/datanode/secondarynamenode
2)启动停止yarn
yarn --daemon start/stop resourcemanager/nodemanager
统一启动关闭脚本myhadoop
#!/bin/bashif [ $# -lt 1 ]then echo "No Args Input..." exit ;ficase $1 in"start") echo " =================== 启动 hadoop 集群 ===================" echo " --------------- 启动 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/start-dfs.sh" echo " --------------- 启动 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/start-yarn.sh" echo " --------------- 启动 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon start historyserver";;"stop") echo " =================== 关闭 hadoop 集群 ===================" echo " --------------- 关闭 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- 关闭 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/stop-yarn.sh" echo " --------------- 关闭 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/stop-dfs.sh";;*) echo "Input Args Error...";;esac
查看三台服务器java进程脚本jpscall
#!/bin/bashfor host in rhnode2 rhnode3 rhnode4do echo =============== $host =============== ssh $host jpsdone
5.常见面试题
5.1常用端口号
Hadoop3.x
- HDFS NameNode 内部通讯端口:8020/9000/9820
- HDFS NameNode 用户查询接口:9870
- Yarn查看任务运行情况:8088
- 历史服务器:19888
Hadoop2.x
- HDFS NameNode 内部通讯端口:8020/9000
- HDFS NameNode 用户查询接口:50070
- Yarn查看任务运行情况:8088
- 历史服务器:19888
5.2常用配置文件
Hadoop3.x
-
core-site.xml
-
hdfs-site.xml
-
yarn-site.xml
-
mapred-site.xml
- workers⭐️
Hadoop2.x
-
core-site.xml
-
hdfs-site.xml
-
yarn-site.xml
-
mapred-site.xml
-
slaves⭐️