Hadoop笔记记录(零)

Hadoop笔记记录(零)



学习尚硅谷Hadoop课程记录

1.搭建

1.1三种模式

  1. 本地模式
    • 数据存储在linux本地,依赖linux帮助进行存储
    • 测试偶尔用
  2. 伪分布式模式
    • 数据存储在HDFS中
  3. 完全分布式模式
    • 数据存储在HDFS/多台服务器工作

基于完全分布式搭建集群

环境:

  • CentOS7
    jdk1.8
    hadoop3.3.1(记录:为兼容windows版本,换成3.1.3版本)

必要安装包

  • epel-release:redhat系一个扩展仓库
  • net-tool
  • vim

关闭防火墙

创建普通用户并赋权

在/opt目录下创建文件夹,修改所属组和所属用户

改静态IP

改用户主机名

改主机名映射

配ssh

卸载原装jdk

安装jdk

安装hadoop

配置环境变量

1.2测试

  1. 测试本地模式wordcount案例
# 进入hadoop安装目录
cd /opt/hadoop
# 创建wcinput文件夹
mkdir wcinput
# 随便写点数据
vim word.txt
# 执行
# hadoop命令 用jar包形式 执行后面这个路径里的examples这个jar包中的wordcound案例,数据输入路径wcinput/,数据输出位置当前路径下的wcoutput/(这个路径必须不存在否则报错)
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount wcinput/ ./wcoutput

随便写的数据word.txt

image-20210728231035410

输出结果

image-20210728230644866

# 进入wcoutput
cd wcoutput
cat part-r-00000

image-20210728230804526

image-20210728231206238

2.常用命令

2.1scp安全拷贝

实现服务器与服务器之间的数据拷贝

基本语法:

# 命令-r 递归   要拷贝的文件路径/名称		目的地用户@主机:目的地路径/名称
scp -r $pdir/$fname $user@$host:$pdir/$fname

案例:

注意,传送过去的文件取决于你用哪个用户连接的目的服务器

# 从本地向远程传送文件
scp -r jdk11   root@192.168.111.102:/opt/java/jdk11
# 从远程拷贝到本地
scp -r root@192.168.111.102:/opt/jdk11   /opt/jdk
# 也可以从远程拷贝到远程
scp -r root@192.168.111.102:/opt/jdk   root@192.168.111.103:/opt/jdk

2.2rsync远程同步工具

第一次同步相当于拷贝

第二次之后就只更新修改的文件

命令类似scp

# -a归档拷贝 -v现实过程
rsync -av $pdir/$fname  $user@$host:$pdir/$f

2.3 xsync脚本编写

#!/bin/bash
#1. 判断参数个数
if [ $# -lt 1 ]
then
        echo Not Enough Arguement!
        exit;
fi
#2. 遍历集群所有机器
for host in rhnode2 rhnode3 rhnode4
do
        echo ==================== $host ====================
        #3. 遍历所有目录,挨个发送
        for file in $@
        do
                #4. 判断文件是否存在
                if [ -e $file ]
                        then
                                #5. 获取父目录
                                pdir=$(cd -P $(dirname $file); pwd)
                                #6. 获取当前文件的名称
                                fname=$(basename $file)
                                ssh $host "mkdir -p $pdir"
                                rsync -av $pdir/$fname $host:$pdir
                        else
                                echo $file does not exists!
                fi
        done
done

注意赋予执行权限,并添加环境变量

3.集群配置

3.1规划

要点:

  • NameNode和SecondaryNameNode不要安装在同一台服务器
  • ResourceManager也很消耗内存,不要和NameNode,SecondaryNameNode配置在同一台机器上

image-20210802163742179

3.2配置

3.2.1配置文件说明

核心四个配置文件,存放位置$HADOOP_HOME/etc/hadoop

  • core-site.xml
  • hdfs-site.xml
  • yarn-site.xml
  • mapred-site.xml
3.2.2 配置集群
  1. 核心配置文件core-site.xml
vim $HADOOP_HOME/etc/hadoop/core-site.xml

文件内容如下

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
		<!-- 指定 NameNode 的地址 -->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://rhnode2:8020</value>
        </property>
        <!-- 指定 hadoop 数据的存储目录 -->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/hadoop/data</value>
        </property>
        <!-- 配置HDFS网页登录使用的静态用户为rh -->
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>rh</value>
        </property>
</configuration>

这里的8020端口是内部的通讯端口,比如和ScrendaryNameNode

  1. 配置HDFS配置文件hdfs-site.xml
vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <!--NameNode web端访问地址-->
        <property>
                <name>dfs.namenode.http-address</name>
                <value>rhnode2:9870</value>
        </property>
        <!--SecondaryNameNode web端访问地址-->
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>rhnode4:9868</value>
        </property>
</configuration>
  1. YARN配置文件
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

        <!-- 指定 MR 走 shuffle -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <!-- 指定 ResourceManager 的地址-->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>rhnode3</value>
        </property>

</configuration>
  1. MapReduce配置文件
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <!-- 指定 MapReduce 程序运行在 Yarn 上 -->
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
    	<!-- 添加如下配置,否则会出现无法加载主类的问题 -->
		<property>
  				<name>yarn.app.mapreduce.am.env</name>
  				<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
		<property>
		 		<name>mapreduce.map.env</name>
		  		<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
		<property>
		  		<name>mapreduce.reduce.env</name>
		  		<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
		</property>
</configuration>
  1. 向各服务器分发
xsync /opt/hadoop/etc/hadoop/
# 输出
==================== rhnode2 ====================
sending incremental file list

sent 1,003 bytes  received 18 bytes  2,042.00 bytes/sec
total size is 111,360  speedup is 109.07
==================== rhnode3 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml

sent 3,045 bytes  received 139 bytes  6,368.00 bytes/sec
total size is 111,360  speedup is 34.97
==================== rhnode4 ====================
sending incremental file list
hadoop/
hadoop/core-site.xml
hadoop/hdfs-site.xml
hadoop/mapred-site.xml
hadoop/yarn-site.xml

sent 3,045 bytes  received 139 bytes  6,368.00 bytes/sec
total size is 111,360  speedup is 34.97
3.2.3 群起集群
  1. 配置workers
vim /opt/hadoop/etc/hadoop/workers

删掉原来的localhost,增加如下内容

rhnode2rhnode3rhnode4

注意:结尾不能有任何空格,文件中不允许有空行,否则之后扫描会出问题

分发一下

xsync /opt/hadoop/etc/hadoop/workers
  1. 启动集群

    1). 如果是集群第一次启动,需要在rhnode2(配置的NameNode服务器)节点格式化NameNode

    注意:格式化 NameNode,会产生新的集群 id,导致 NameNode 和 DataNode 的集群 id 不一致,集群找不到已往数据。如果集群在运行过程中报错,需要重新格式化 NameNode 的话,一定要先停止 namenode 和 datanode 进程,并且要删除所有机器的 data 和 logs 目录,然后再进行格式化。

    hdfs namenode -format
    

    根据配置会在/opt/hadoop中生成data和logs文件夹

    记录一下VERSION

    cat data/dfs/name/current/VERSION
    

    image-20210802181137541

    2). 启动HDFS

    sbin/start-dfs.sh
    

    image-20210802181430675

    然后在三个节点上结合规划查看一下是否启动正确

    rhnode2

    image-20210802181557511

    rhnode3

    image-20210802181642008

    rhnode4

    image-20210802181701331

    3). 在配置了ResourceManager的节点(rhnode3)启动yarn

    sbin/start-yarn.sh
    

    image-20210802203842876

  2. 测试

    1)上传文件

    hadoop fs -mkdir /inputhadoop fs -put $HADOOP_HOME/wcinput/word.txt /input
    

    2)查看

    可以在web端查看

image-20210804202442116

​ 可以在本地查看

cat /opt/hadoop/data/dfs/data/current/BP-457611527-192.168.111.102-1627898916141/current/finalized/subdir0/subdir0/blk_1073741825

image-20210804202715410

​ 3)执行wordcount程序

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output

记录:此处报错,参见问题记录一,二

3.2崩溃重新格式化

# 杀进程kill -p xxx# 查看jps# 删除data和logs文件夹rm -rf data/ logs/# 格式化集群hdfs namenode -format

3.3配置历史服务器

目的:当一个任务执行完毕后web界面关闭后无法重新查看,为了查看历史执行任务

  1. 配置mapred-site.xml
vim mapred-site.xml

添加如下配置:

<!-- 历史服务器端地址 --><property> <name>mapreduce.jobhistory.address</name> <value>rhnode2:10020</value></property><!-- 历史服务器 web 端地址 --><property> <name>mapreduce.jobhistory.webapp.address</name> <value>rhnode2:19888</value></property>
  1. 分发配置
xsync etc/hadoop/mapred-site.xml
  1. 在配置的节点上启动历史服务器
mapred --daemon start historyserver
  1. jps查看
  2. web端查看

http://rhnode2:19888/jobhistory

3.4开启日志聚集功能

应用运行完成以后,将程序运行日志信息上传到hdfs系统上

集合三个节点上所有的日志到一个节点

注意:开启日志聚集功能,需要重新启动NodeManager,ResourceManager和HistoryServer

  1. 配置yarn-site.xml

添加如下配置

<!-- 开启日志聚集功能 --><property> <name>yarn.log-aggregation-enable</name> <value>true</value></property><!-- 设置日志聚集服务器地址 --><property>  <name>yarn.log.server.url</name>  <value>http://rhnode2:19888/jobhistory/logs</value></property><!-- 设置日志保留时间为 7 天 --><property> <name>yarn.log-aggregation.retain-seconds</name> <value>604800</value></property>
  1. 分发配置
xsync etc/hadoop/yarn-site.xml
  1. 关闭NodeManager,ResourceManager,HistoryServer
sbin/stop-yarn.shmapred --daemon stop historyserver
  1. 启动NodeManager,ResourceManager,HistoryServer
start-yarn.shmapred --daemon start historyserver
  1. 删除HDFS上已经存在的输出文件
hadoop fs -rm -r /output
  1. 执行wordcount程序
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
  1. 查看日志

http://rhnode2:19888/jobhistory

4.常用脚本

4.1集群启停方式

  1. 各个模块分开启动停止

    1)整体启动停止HDFS

    start-dfs.sh/stop-dfs.sh

    2)整体启动停止YARN

    start-yarn.sh/stop-yarn.sh

  2. 各个服务组件逐一启动停止

    1)分别启动停止HDFS组件

    hdfs --daemon start/stop namenode/datanode/secondarynamenode
    

    2)启动停止yarn

    yarn --daemon start/stop resourcemanager/nodemanager
    

统一启动关闭脚本myhadoop

#!/bin/bashif [ $# -lt 1 ]then echo "No Args Input..." exit ;ficase $1 in"start") echo " =================== 启动 hadoop 集群 ===================" echo " --------------- 启动 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/start-dfs.sh" echo " --------------- 启动 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/start-yarn.sh" echo " --------------- 启动 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon start historyserver";;"stop") echo " =================== 关闭 hadoop 集群 ===================" echo " --------------- 关闭 historyserver ---------------" ssh rhnode2 "/opt/hadoop/bin/mapred --daemon stop historyserver" echo " --------------- 关闭 yarn ---------------" ssh rhnode3 "/opt/hadoop/sbin/stop-yarn.sh" echo " --------------- 关闭 hdfs ---------------" ssh rhnode2 "/opt/hadoop/sbin/stop-dfs.sh";;*) echo "Input Args Error...";;esac

查看三台服务器java进程脚本jpscall

#!/bin/bashfor host in rhnode2 rhnode3 rhnode4do echo =============== $host =============== ssh $host jpsdone

5.常见面试题

5.1常用端口号

Hadoop3.x

  • HDFS NameNode 内部通讯端口:8020/9000/9820
  • HDFS NameNode 用户查询接口:9870
  • Yarn查看任务运行情况:8088
  • 历史服务器:19888

Hadoop2.x

  • HDFS NameNode 内部通讯端口:8020/9000
  • HDFS NameNode 用户查询接口:50070
  • Yarn查看任务运行情况:8088
  • 历史服务器:19888

5.2常用配置文件

Hadoop3.x

  • core-site.xml

  • hdfs-site.xml

  • yarn-site.xml

  • mapred-site.xml

    • workers⭐️

Hadoop2.x

  • core-site.xml

  • hdfs-site.xml

  • yarn-site.xml

  • mapred-site.xml

  • slaves⭐️

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值