Linux服务器搭建Hadoop单节点伪分布式
官网:https://hadoop.apache.org/
安装Hadoop
下载地址:https://archive.apache.org/dist/hadoop/core/
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.3.2/hadoop-3.3.2.tar.gz
解压且重命名
tar -zxvf hadoop-3.3.2.tar.gz
mv hadoop-3.3.2 hadoop
配置环境变量
vi /etc/profile
配置Hadoop环境信息
# hadoop
export HADOOP_HOME=/usr/local/program/hadoop
export PATH=:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
使配置生效
source /etc/profile
测试
hadoop version
主机别名配置
修改主机名
vim /etc/hostname
hostnamectl set-hostname node01
查看主机名
hostname
hostnamectl
更新/etc/hosts,添加 IP 与主机别名 映射
172.22.4.21 node01
修改配置文件
hadoop的配置文件都在hadoop/etc/hadoop
目录下,主要修改:core-site.xml
、hadoop-env.sh
、hdfs-site.xml
、mapred-site.xml
、yarn-site.xml
注意:
参考网上各种配置,踩坑无数,以下每类配置都分2份,第一份最基础配置,第二份可选扩展配置,推荐使用每类最基础配置即可。
core-site.xml
vim hadoop/etc/hadoop/core-site.xml
基础配置
<configuration>
<!-- 指定文件系统类型,HDFS的通信地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:9000</value>
</property>
<!-- 临时文件存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/program/hadoop/datas/tmp</value>
</property>
可选配置
<!-- 缓冲区大小,实际工作中根据服务器性能动态调整 -->
<property>
<name>io.file.buffer.size</name>
<value>8192</value>
</property>
<!-- 开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟 -->
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
</configuration>
hadoop-env.sh
hadoop-env.sh是hadoop环境配置文件
vim hadoop/etc/hadoop/hadoop-env.sh
# export JAVA_HOME=
export JAVA_HOME=/usr/local/jdk1.8/
hdfs-site.xml
vim hadoop/etc/hadoop/hdfs-site.xml
<!--namenode元数据存放目录-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/program/hadoop/datas/namenode/namenodedatas</value>
</property>
<!--datanode数据存放目录-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/program/hadoop/datas/datanode/datanodeDatas</value>
</property>
<!--HDFS文件副本数,当前伪分布模式只有一个节点,设置为1-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--文件分块/block大小,默认128MB-->
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<!--HDFS浏览器访问端口,2.x版本50070端口,3.x版本9870-->
<property>
<name>dfs.namenode.http-address</name>
<value>node01:9870</value>
</property>
<!--关闭HDFS访问权限-->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
mapred-site.xml
vim hadoop/etc/hadoop/mapred-site.xml
<!--指定Mapreduce执行框架-->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!--以下必须配置,否则运行MapReduce会提示检查是否配置-->
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/usr/local/program/hadoop</value>
</property>
<!--maper container 的内存-->
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
</property>
<!--Maper 端 JVM内存-->
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx512M</value>
</property>
<!--Reducer container 内存-->
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
</property>
<!--Reducer JVM 内存-->
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx512M</value>
</property>
<!--用于map输出排序的内存大小 缓冲区的大小默认为100M-->
<property>
<name>mapreduce.task.io.sort.mb</name>
<value>256</value>
</property>
<!--排序文件时一次合并多少个文件的数量-->
<property>
<name>mapreduce.task.io.sort.factor</name>
<value>100</value>
</property>
<!--提取map输出的copier线程数-->
<property>
<name>mapreduce.reduce.shuffle.parallelcopies</name>
<value>25</value>
</property>
<!--job运行日志信息访问地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>node01:10020</value>
</property>
<!--jobhistory浏览器访问地址-->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>node01:19888</value>
</property>
yarn-site.xml
vim hadoop/etc/hadoop/yarn-site.xml
<!--NodeManager上运行的附属服务。需配置成mapreduce_shuffle,才可运行MapReduce程序-->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--是否启用日志聚合,默认false-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--RM对客户端暴露的地址,客户端通过该地址向RM提交应用程序等-->
<property>
<name>yarn.resourcemanager.address</name>
<value>node01:8032</value>
</property>
<!--RM对AM暴露的地址,AM通过地址想RM申请资源,释放资源等-->
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>node01:8030</value>
</property>
<!--RM对NM暴露地址,NM通过该地址向RM汇报心跳,领取任务等-->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>node01:8031</value>
</property>
<!--管理员可以通过该地址向RM发送管理命令等-->
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>node01:8033</value>
</property>
<!--RM对外暴露的web http地址,用户可通过该地址在浏览器中查看集群信息-->
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>node01:8088</value>
</property>
<!--RM的hostname-->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<!--可申请的最少内存资源-->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<!--可申请的最大内存资源-->
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>2048</value>
</property>
<!--物理内存与虚拟内存的比率-->
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<!-- 设置不检查虚拟内存的值,不然内存不够会报错 -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!--NM总的可用物理内存,以MB为单位。一旦设置,不可动态修改-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>
</property>
<!-- yarn上面运行一个任务,最少需要1.5G内存,虚拟机没有这么大的内存就调小这个值,不然会报错 -->
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1024</value>
</property>
格式化HDFS
HDFS首次启动需要进行格式化需要一个格式化的过程来创建存放元数据(image, editlog)的目录;格式化是对分布式文件系统HDFS中的数据节点DataNode进行分块,统计所有分块后的初始元数据,并存储在NameNode中
hdfs namenode -format
启动
HDFS的启动和停止
启动HDFS将启动NameNode、DataNode、SecondaryNameNode三个进程
已配置hadoop环境变量,故只需要输入start-dfs.sh就可以启动,否则进入Hadoop主目录下的sbin执行./start-dfs.sh
start-dfs.sh
stop-dfs.sh
jps查看Java相关进程
[root@administrator program]# jps
31348 DataNode
31995 SecondaryNameNode
31069 NameNode
717 Jps
单独启动命令
hadoop-daemon.sh start namenode #启动NameNode
hadoop-daemon.sh start datanode #启动DataNode
hadoop-daemon.sh start secondarynamenode #启动SecondaryNameNode
hadoop-daemon.sh start namenode #启动NameNode
hadoop-daemon.sh start datanode #启动DataNode
hadoop-daemon.sh start secondarynamenode #启动SecondaryNameNode
启动和停止YARN
start-yarn.sh
stop-yarn.sh
单独启动命令
yarn-daemon.sh start resourcemanager #启动ResourceManager
yarn-daemon.sh start nodemanager #启动NodeManager
yarn-daemon.sh stop resourcemanager #停止ResourceManager
yarn-daemon.sh stop nodemanager #停止NodeManager
验证
[root@administrator hadoop]# jps
9923 NodeManager
2915 NameNode
15956 Jps
3130 DataNode
9692 ResourceManager
3630 SecondaryNameNode
同时启动或停止HDFS和YARN
start-all.sh
stop-all.sh
启动异常解决
1.启动出现异常:
Starting namenodes on [IP]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [IP]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
在 vim hadoop/etc/hadoop/hadoop-env.sh 配置文件末尾添加配置
export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"
2.再次启动,出现异常
[root@administrator program]# start-dfs.sh
Starting namenodes on [IP]
上一次登录:日 3月 6 21:47:15 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting datanodes
上一次登录:日 3月 6 21:58:58 CST 2022pts/4 上
localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
Starting secondary namenodes [IP]
上一次登录:日 3月 6 21:58:58 CST 2022pts/4 上
IP: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
即使本机使用SSH服务也是需要对自己进行公私钥授权,所以在本机通过ssh-keygen创建好公私钥,然后将公钥复制到公私钥的认证文件中
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3.NameNode节点无法启动,端口被占用
java.net.BindException: Port in use:
查看端口占用,发现无进程占用该端口
lsof -i:port
修改hostname,vim /etc/hostname
,若是集群节点配置名称不能重复
[root@administrator logs]# cat /etc/hostname
node01
若使用别名,则修改/etc/hosts
,内网IP绑定别名;否则直接使用内网IP;ifconfig
查看内网ip,使用内网IP或127.0.0.1
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.22.4.21 netmask 255.255.192.0 broadcast 172.22.63.255
ether 00:16:3e:02:73:19 txqueuelen 1000 (Ethernet)
RX packets 59092928 bytes 13499777151 (12.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38224223 bytes 9940817189 (9.2 GiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cat /etc/hosts
172.22.4.21 node01
防火墙设置
systemctl start firewalld.service #启动firewall
systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动
访问WebUI
切记开放相应端口
访问IP:8088
访问IP:9870
,访问之前关闭防火墙
作业测试
准备数据,vim test.txt
MapReduce is a programming
paradigm that enables
massive scalability across
hundreds or thousands of
servers in a Hadoop cluster.
As the processing component,
MapReduce is the heart of Apache Hadoop.
The term "MapReduce" refers to two separate
and distinct tasks that Hadoop programs perform.
将test.txt
上传到HDFS
hadoop dfs -put test.txt /input
提交作业
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.2.jar wordcount /input /out
检查输出:hdfs dfs -ls /out/
Found 2 items
-rw-r--r-- 1 root supergroup 0 2022-03-08 13:49 /out/_SUCCESS
-rw-r--r-- 1 root supergroup 332 2022-03-08 13:49 /out/part-r-00000