Hadoop入门篇
概述
Hadoop是使用Java编写的,是为了解决大数据场景下的两大问题,分布式存储和分布式处理而诞生的,包含很多组件、套件。需要运行在Linux系统下。主要包括HDFS 和 MapReduce两个组件。
下载安装
下载
下载地址 https://archive.apache.org/dist/hadoop/common/
选择合适自己的tar.gz版本下载,该文档选择V3.2.1。
Hadoop是Java开发的,所以依赖jdk运行,要先安装jdk
Hadoop和jdk版本对应关系如下:
Hadoop版本 | jdk版本 |
---|---|
>Hadoop3.3 | java8 or java11(runTime) |
Hadoop3.0~Hadoop3.2 | java8 |
Hadoop2.7~Hadoop2.10 | java7 and java8 |
安装
Hadoop安装分为三种模式,单机模式、伪分布式模式,分布式模式。
单机模式主要是用来测试学习使用,底层使用的还是系统自带的文件系统。伪分布式和分布式模式底层使用Hdfs文件系统。
-
单机模式安装
将tar.gz包上传到Linux目录下解压,并将解压后目录变成hadoop。编辑./etc/hadoop/hadoop-env.sh文件,配置jdk路径
# The java implementation to use. By default, this environment # variable is REQUIRED on ALL platforms except OS X! # export JAVA_HOME= export JAVA_HOME=/usr/java/jdk1.8.0_201-amd64
执行如下命令测试Hadoop安装环境
[root@k8s-node-107 hadoop]# bin/hadoop version Hadoop 3.2.1 Source code repository Unknown -r 7a3bc90b05f257c8ace2f76d74264906f0f7a932 Compiled by hexiaoqiao on 2021-01-03T09:26Z Compiled with protoc 2.5.0 From source with checksum 5a8f564f46624254b27f6a33126ff4 This command was run using /home/bigData/soft/hadoop/share/hadoop/common/hadoop-common-3.2.2.jar
执行如下命令使用hadoop自带的计数案例测试hadoop运行情况(命令在安装根目录同级目录执行):
mkdir input cp hadoop/etc/hadoop/*.xml input hadoop/bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+' [root@localhost soft]# cat output/* 1 dfsadmin
思考:单机模式仅仅只是为了测试开发的jar包是否可用,在运行中使用到了MapReduce进行计算,但未使用到Hdfs.
-
伪分布式模式
hadoop伪分布式安装遇到的大多数问题来源于对Linux系统常用操作的不熟悉,比如新建用户、权限赋予、ssh免登陆设置
1、 配置Hadoop相关环境变量(/etc/profile文件中加)
# Hadoop Environment Variables export HADOOP_HOME=/home/bigData/soft/hadoop export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
2、配置启动用户
由于hadoop伪分布式/分布式架构中主节点需要通过ssh去启动其他从节点,由于ssh默认是有密码的,所以需要设置ssh免密登录。但又不想破坏虚拟机原有用户ssh鉴权,所以会新建一个用户,这里新建用户为hadoop。脚本如下:
adduser hadoop //添加用户hadoop passwd hadoop //设置密码 id hadoop // 查看hadoop用户组信息 usermod -g root hadoop //将hadoop加入root超管权限组 su hadoop //切换到hadoop账号 sudo chmod 777 -R hadoop //更改hadoop目录权限
3、修改sudoers文件
切换到root账户,修改sudoers文件,不然无法在新账户里使用sudo命令,会报如下错误:
hadoop is not in the sudoers file. This incident will be reported.
修改命令如下:
[root@localhost hadoop]# chmod a+x /etc/sudoers [root@localhost hadoop]# vi /etc/sudoers
在文件sudoers中增加如下hadoop用户设置,保存时需要使用:wq!,要带!号,不然会显示只读文件无法修改
#Allow root to run any commands anywhere root ALL=(ALL) ALL hadoop ALL=(ALL) ALL
4、配置免密登录
切换到hadoop用户后测试是否支持免密登录,输入ssh localhost,如下
[hadoop@localhost soft]$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. ECDSA key fingerprint is SHA256:cbS92o4o5+EzTyMUh93la2K25R2niIP10hRIMmh/zRA. ECDSA key fingerprint is MD5:d6:3b:b0:e7:6d:6f:b8:57:83:6c:db:9e:88:73:a8:e4. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. hadoop@localhost's password: Permission denied, please try again. hadoop@localhost's password:
如上提示输入密码则不支持免密登录。下面是设置免密登录脚本:
[hadoop@localhost soft]$ cd ~/.ssh/ # 若没有该目录,请先执行一次ssh localhost [hadoop@localhost .ssh]$ ssh-keygen -t rsa # 出现输入提示,都按回车就可以 Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:DptnAzC7LHbHldsrnr0aQrhIc157McjaLXvuM3D0CxQ hadoop@localhost.localdomain The key's randomart image is: +---[RSA 2048]----+ | | | E | | o . | | * .o. | | o + BoS. | | . * O.Xo=. | | + B XoX... | | . o . B==.. | | .=*=+. | +----[SHA256]-----+ [hadoop@localhost .ssh]$ ls id_rsa id_rsa.pub known_hosts [hadoop@localhost .ssh]$ cat id_rsa.pub >> authorized_keys # 加入授权 [hadoop@localhost .ssh]$ ls authorized_keys id_rsa id_rsa.pub known_hosts [hadoop@localhost .ssh]$ chmod 600 ./authorized_keys # 修改文件权限
再次尝试ssh localhost,如下则表示免密登录设置成功
[hadoop@localhost .ssh]$ ssh localhost
Last failed login: Thu Jul 1 17:32:19 CST 2021 from localhost on ssh:notty
There were 2 failed login attempts since the last successful login.
Last login: Thu Jul 1 17:31:21 2021
5、 修改4个关键配置文件
vi etc/hadoop/core-site.xml
<configuration>
<!-- 配置dataNode保存数据的位置 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/bigData/soft/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<!-- 配置HDFS的主节点,nameNode -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:39000</value>
</property>
</configuration>
vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/bigData/soft/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/bigData/soft/hadoop/tmp/dfs/data</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>0.0.0.0:9870</value>
</property>
</configuration>
vi etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
vi etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanger.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarme>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
6、HDFS初始化
[hadoop@localhost hadoop]# hdfs namenode -format
WARNING: /home/bigData/soft/hadoop/logs does not exist. Creating.
2021-06-25 12:55:08,910 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.2.2
.....
2021-06-25 12:55:10,799 INFO util.GSet: 0.029999999329447746% max memory 4.3 GB = 1.3 MB
2021-06-25 12:55:10,799 INFO util.GSet: capacity = 2^17 = 131072 entries
2021-06-25 12:55:10,886 INFO namenode.FSImage: Allocated new BlockPoolId: BP-494110815-127.0.0.1-1624596910865
2021-06-25 12:55:10,902 INFO common.Storage: Storage directory /home/bigData/soft/hadoop/datanode/dfs/name has been successfully formatted.
2021-06-25 12:55:10,959 INFO namenode.FSImageFormatProtobuf: Saving image file /home/bigData/soft/hadoop/datanode/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2021-06-25 12:55:11,100 INFO namenode.FSImageFormatProtobuf: Image file /home/bigData/soft/hadoop/datanode/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 399 bytes saved in 0 seconds .
2021-06-25 12:55:11,121 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2021-06-25 12:55:11,129 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid=0 when meet shutdown.
2021-06-25 12:55:11,129 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************/
7、 修改hdfs启动、停止脚本
修改start-dfs.sh,stop-dfs.sh脚本,在文件头部增加如下脚本:
HDFS_DATANODE_USER=hadoop
HADOOP_SECURE_DN_USER=hadoop
HDFS_NAMENODE_USER=hadoop
HDFS_SECONDARYNAMENODE_USER=hadoop
8、启动dfs
执行start-dfs.sh启动完成后,执行jps查看启动进程,如果有如下四个进程则启动成功:
[hadoop@localhost hadoop]$ jps
114497 Jps
113914 NameNode
114314 SecondaryNameNode
114044 DataNode
若namenode或者其他进程无法启动,一定要去logs下查看日志
浏览器输入http://ip:9870/访问hdfs管理界面,其中ip地址为部署服务器地址。
关闭命令
stop-dfs.sh
9、启动yarn
执行start-yarn.sh启动完成后,执行jps查看启动进程,如果有如下NodeManager、ResourceManager进程则启动成功:
[hadoop@localhost hadoop]$ jps
113914 NameNode
114314 SecondaryNameNode
121629 NodeManager
114044 DataNode
121516 ResourceManager
121759 Jps
浏览器输入http://ip:8088/访问yarn管理界面,其中ip地址为部署服务器地址。
关闭命令
stop-yarn.sh
8、查看日志
cd logs/
tail -300f hadoop-hadoop-namenode-localhost.localdomain.log
#export HADOOP_ROOT_LOGGER=DEBUG,console
9、 常见问题
Q1:按照以上流程跑起来后,jps查看,未发现namenode进程
A1: 进入到logs/目录查看namenode的日志,发现如下错误:
2021-07-05 16:23:01,750 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2021-07-05 16:23:01,750 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2021-07-05 16:23:01,751 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2021-07-05 16:23:01,761 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.net.BindException: Problem binding to [localhost:9000] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
这问题明显是9000端口占用引起的冲突,修改端口值即可解决问题。由于没有日志意识,导致该问题困扰自己一下午。
ps:其它遇到的问题都在安装步骤里完善了,这里就不再赘述。
结论
Hadoop安装对Linux基础知识要求较高,所以它的学习门槛相对较高一些。本人也是亲自踩了若干坑,也是巩固了自己的基础知识,记录如上,欢迎大家来沟通交流。