Hadoop笔记记录(零)

最新推荐文章于 2022-12-19 20:33:18 发布

RIICARDO_

最新推荐文章于 2022-12-19 20:33:18 发布

阅读量110

点赞数

分类专栏： Hadoop Record 文章标签： hadoop 大数据分布式

本文链接：https://blog.csdn.net/Ricardoge/article/details/119670124

版权

Hadoop Record 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

Hadoop笔记记录(零)

学习尚硅谷Hadoop课程记录

1.搭建

1.1三种模式

本地模式
- 数据存储在linux本地，依赖linux帮助进行存储
- 测试偶尔用
伪分布式模式
- 数据存储在HDFS中
完全分布式模式
- 数据存储在HDFS/多台服务器工作

基于完全分布式搭建集群

环境：

CentOS7
jdk1.8
hadoop3.3.1（记录：为兼容windows版本，换成3.1.3版本）

必要安装包

epel-release：redhat系一个扩展仓库
net-tool
vim

关闭防火墙

创建普通用户并赋权

在/opt目录下创建文件夹，修改所属组和所属用户

改静态IP

改用户主机名

改主机名映射

配ssh

卸载原装jdk

安装jdk

安装hadoop

配置环境变量

1.2测试

测试本地模式wordcount案例

# 进入hadoop安装目录
cd /opt/hadoop
# 创建wcinput文件夹
mkdir wcinput
# 随便写点数据
vim word.txt
# 执行
# hadoop命令 用jar包形式 执行后面这个路径里的examples这个jar包中的wordcound案例，数据输入路径wcinput/，数据输出位置当前路径下的wcoutput/（这个路径必须不存在否则报错）
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount wcinput/ ./wcoutput

随便写的数据word.txt

输出结果

# 进入wcoutput
cd wcoutput
cat part-r-00000

2.常用命令

2.1scp安全拷贝

实现服务器与服务器之间的数据拷贝

基本语法：

# 命令-r 递归   要拷贝的文件路径/名称		目的地用户@主机：目的地路径/名称
scp -r $pdir/$fname $user@$host:$pdir/$fname

案例：

注意，传送过去的文件取决于你用哪个用户连接的目的服务器

# 从本地向远程传送文件
scp -r jdk11   root@192.168.111.102:/opt/java/jdk11
# 从远程拷贝到本地
scp -r root@192.168.111.102:/opt/jdk11   /opt/jdk
# 也可以从远程拷贝到远程
scp -r root@192.168.111.102:/opt/jdk   root@192.168.111.103:/opt/jdk

2.2rsync远程同步工具

第一次同步相当于拷贝

第二次之后就只更新修改的文件

命令类似scp

# -a归档拷贝 -v现实过程
rsync -av $pdir/$fname  $user@$host:$pdir/$f

2.3 xsync脚本编写

#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
        echo Not Enough Arguement!
        exit;
fi
#2. 遍历集群所有机器
for host in rhnode2 rhnode3 rhnode4
do
        echo ==================== $host ====================
        #3. 遍历所有目录,挨个发送
        for file in $@
        do
                #4. 判断文件是否存在
                if [ -e $file ]
                        then
                                #5. 获取父目录
                                pdir=$(cd -P $(dirname $file); pwd)
                                #6. 获取当前文件的名称
                                fname=$(basename $file)
                                ssh $host "mkdir -p $pdir"
                                rsync -av $pdir/$fname $host:$pdir
                        else
                                echo $file does not exists!
                fi
        done
done

注意赋予执行权限，并添加环境变量

3.集群配置

3.1规划

要点：

NameNode和SecondaryNameNode不要安装在同一台服务器
ResourceManager也很消耗内存，不要和NameNode，SecondaryNameNode配置在同一台机器上

3.2配置

3.2.1配置文件说明

核心四个配置文件，存放位置$HADOOP_HOME/etc/hadoop

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

3.2.2 配置集群

核心配置文件core-site.xml

vim $HADOOP_HOME/etc/hadoop/core-site.xml

文件内容如下

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
		<!-- 指定 NameNode 的地址 -->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://rhnode2:8020</value>
        </property>
        <!-- 指定 hadoop 数据的存储目录 -->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/hadoop/data</value>
        </property>
        <!-- 配置HDFS网页登录使用的静态用户为rh -->
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>rh</value>
        </property>
</configuration>

这里的8020端口是内部的通讯端口，比如和ScrendaryNameNode

配置HDFS配置文件hdfs-site.xml

vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <!--NameNode web端访问地址-->
        <property>
                <name>dfs.namenode.http-address</name>
                <value>rhnode2:9870</value>
        </property>
        <!--SecondaryNameNode web端访问地址-->
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>rhnode4:9868</value>
        </property>
</configuration>

YARN配置文件

vim $HADOOP_HOME/etc/hadoop/yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

        <!-- 指定 MR 走 shuffle -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <!-- 指定 ResourceManager 的地址-->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>rhnode3</value>
        </property>

</configuration>

MapReduce配置文件

vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <!-- 指定 MapReduce 程序运行在 Yarn 上 -->
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
    	<!-- 添加如下配置，否则会出现无法加载主类的问题 -->
		<property>
  				<name>yarn.app.mapreduce.am.env</name>
  				<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
		<property>
		 		<name>mapreduce.map.env</name>
		  		<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
		<property>
		  		<name>mapreduce.reduce.env</name>
		  		<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
</configuration>

向各服务器分发

xsync /opt/hadoop/etc/hadoop/
# 输出
==================== rhnode2 ====================
sending incremental file list

sent 1,003 bytes  received 18 bytes  2,042.00 bytes/sec
total size is 111,360  speedup is 109.07
==================== rhnode3 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml

sent 3,045 bytes  received 139 bytes  6,368.00 bytes/sec
total size is 111,360  speedup is 34.97
==================== rhnode4 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml

sent 3,045 bytes  received 139 bytes  6,368.00 bytes/sec
total size is 111,360  speedup is 34.97

3.2.3 群起集群

配置workers

vim /opt/hadoop/etc/hadoop/workers

删掉原来的localhost,增加如下内容

rhnode2rhnode3rhnode4

注意：结尾不能有任何空格，文件中不允许有空行，否则之后扫描会出问题

分发一下

xsync /opt/hadoop/etc/hadoop/workers

启动集群

1). 如果是集群第一次启动，需要在rhnode2（配置的NameNode服务器）节点格式化NameNode

注意:格式化 NameNode,会产生新的集群 id,导致 NameNode 和 DataNode 的集群 id 不一致,集群找不到已往数据。如果集群在运行过程中报错,需要重新格式化 NameNode 的话,一定要先停止 namenode 和 datanode 进程,并且要删除所有机器的 data 和 logs 目录,然后再进行格式化。
```
hdfs namenode -format
```
根据配置会在/opt/hadoop中生成data和logs文件夹

记录一下VERSION
```
cat data/dfs/name/current/VERSION
```
2). 启动HDFS
```
sbin/start-dfs.sh
```
然后在三个节点上结合规划查看一下是否启动正确

rhnode2

rhnode3

rhnode4

3). 在配置了ResourceManager的节点（rhnode3）启动yarn
```
sbin/start-yarn.sh
```

测试

1)上传文件

hadoop fs -mkdir /inputhadoop fs -put $HADOOP_HOME/wcinput/word.txt /input

2)查看

可以在web端查看

可以在本地查看

cat /opt/hadoop/data/dfs/data/current/BP-457611527-192.168.111.102-1627898916141/current/finalized/subdir0/subdir0/blk_1073741825

3)执行wordcount程序

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output

记录：此处报错，参见问题记录一，二

3.2崩溃重新格式化

# 杀进程kill -p xxx# 查看jps# 删除data和logs文件夹rm -rf data/ logs/# 格式化集群hdfs namenode -format

3.3配置历史服务器

目的：当一个任务执行完毕后web界面关闭后无法重新查看，为了查看历史执行任务

配置mapred-site.xml

vim mapred-site.xml

添加如下配置：

<!-- 历史服务器端地址 --><property> <name>mapreduce.jobhistory.address</name> <value>rhnode2:10020</value></property><!-- 历史服务器 web 端地址 --><property> <name>mapreduce.jobhistory.webapp.address</name> <value>rhnode2:19888</value></property>

分发配置

xsync etc/hadoop/mapred-site.xml

在配置的节点上启动历史服务器

mapred --daemon start historyserver

jps查看
web端查看

http://rhnode2:19888/jobhistory

3.4开启日志聚集功能

应用运行完成以后，将程序运行日志信息上传到hdfs系统上

集合三个节点上所有的日志到一个节点

注意：开启日志聚集功能，需要重新启动NodeManager，ResourceManager和HistoryServer

配置yarn-site.xml

添加如下配置

<!-- 开启日志聚集功能 --><property> <name>yarn.log-aggregation-enable</name> <value>true</value></property><!-- 设置日志聚集服务器地址 --><property>  <name>yarn.log.server.url</name>  <value>http://rhnode2:19888/jobhistory/logs</value></property><!-- 设置日志保留时间为 7 天 --><property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value></property>

分发配置

xsync etc/hadoop/yarn-site.xml

关闭NodeManager，ResourceManager，HistoryServer

sbin/stop-yarn.shmapred --daemon stop historyserver

启动NodeManager，ResourceManager，HistoryServer

start-yarn.shmapred --daemon start historyserver

删除HDFS上已经存在的输出文件

hadoop fs -rm -r /output

执行wordcount程序

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output

查看日志

http://rhnode2:19888/jobhistory

4.常用脚本

4.1集群启停方式

各个模块分开启动停止

1）整体启动停止HDFS

start-dfs.sh/stop-dfs.sh

2）整体启动停止YARN

start-yarn.sh/stop-yarn.sh

各个服务组件逐一启动停止

1）分别启动停止HDFS组件

hdfs --daemon start/stop namenode/datanode/secondarynamenode

2)启动停止yarn

yarn --daemon start/stop resourcemanager/nodemanager

统一启动关闭脚本myhadoop

#!/bin/bashif [ $# -lt 1 ]then echo "No Args Input..." exit ;ficase $1 in"start") echo " =================== 启动 hadoop 集群 ===================" echo " --------------- 启动 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/start-dfs.sh" echo " --------------- 启动 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/start-yarn.sh" echo " --------------- 启动 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon start historyserver";;"stop") echo " =================== 关闭 hadoop 集群 ===================" echo " --------------- 关闭 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- 关闭 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/stop-yarn.sh" echo " --------------- 关闭 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/stop-dfs.sh";;*) echo "Input Args Error...";;esac

查看三台服务器java进程脚本jpscall

#!/bin/bashfor host in rhnode2 rhnode3 rhnode4do echo =============== $host =============== ssh $host jpsdone

5.常见面试题

5.1常用端口号

Hadoop3.x

HDFS NameNode 内部通讯端口：8020/9000/9820
HDFS NameNode 用户查询接口：9870
Yarn查看任务运行情况：8088
历史服务器：19888

Hadoop2.x

HDFS NameNode 内部通讯端口：8020/9000
HDFS NameNode 用户查询接口：50070
Yarn查看任务运行情况：8088
历史服务器：19888

5.2常用配置文件

Hadoop3.x

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
- workers⭐️

Hadoop2.x

core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
slaves⭐️

RIICARDO_

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop笔记记录(零)

Hadoop笔记记录(零)文章目录Hadoop笔记记录(零)1.搭建1.1三种模式1.2测试2.常用命令2.1scp安全拷贝2.2rsync远程同步工具2.3 xsync脚本编写3.集群配置3.1规划3.2配置3.2.1配置文件说明3.2.2 配置集群3.2.3 群起集群3.2崩溃重新格式化3.3配置历史服务器3.4开启日志聚集功能4.常用脚本4.1集群启停方式5.常见面试题5.1常用端口号5.2常用配置文件学习尚硅谷Hadoop课程记录1.搭建1.1三种模式本地模式数据存储在lin
复制链接

扫一扫