文章目录
#HDFS
官网:hadoop.apacha.org
先创建一个普通用户,并设定该用户的密码
[root@server21 ~]# useradd hadoop
[root@server21 ~]# echo westos | passwd --stdin hadoop
[root@server21 ~]# su - hadoop
[hadoop@server21 ~]$ pwd
/home/hadoop
1. 安装与配置Hadoop
- 安装JDK,并创建JDK的软链接(方便后续进行升级)
[hadoop@server21 ~]$ pwd
/home/hadoop
[hadoop@server21 ~]$ ls
hadoop-3.2.1.tar.gz jdk-8u181-linux-x64.tar.gz
[hadoop@server21 ~]$ tar zxf jdk-8u181-linux-x64.tar.gz
[hadoop@server21 ~]$ ls
hadoop-3.2.1.tar.gz jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[hadoop@server21 ~]$ ln -s jdk1.8.0_181/ java
[hadoop@server21 ~]$ ls
hadoop-3.2.1.tar.gz java jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[hadoop@server21 ~]$ ll
total 532076
-rw-r--r-- 1 hadoop hadoop 359196911 Apr 24 09:56 hadoop-3.2.1.tar.gz
lrwxrwxrwx 1 hadoop hadoop 13 Apr 24 09:57 java -> jdk1.8.0_181/
drwxr-xr-x 7 hadoop hadoop 245 Jul 7 2018 jdk1.8.0_181
-rw-r--r-- 1 hadoop hadoop 185646832 Apr 24 09:56 jdk-8u181-linux-x64.tar.gz
- 配置JDK的环境变量,使系统可以找到JDK
[hadoop@server21 ~]$ vim .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/java/bin
[hadoop@server21 ~]$ source .bash_profile
- 安装Hadoop,并创建Hadoop的软链接
[hadoop@server21 ~]$ tar zxf hadoop-3.2.1.tar.gz
[hadoop@server21 ~]$ ln -s hadoop-3.2.1 hadoop
[hadoop@server21 ~]$ ll
total 532076
lrwxrwxrwx 1 hadoop hadoop 12 Apr 24 09:58 hadoop -> hadoop-3.2.1
drwxr-xr-x 9 hadoop hadoop 149 Sep 11 2019 hadoop-3.2.1
-rw-r--r-- 1 hadoop hadoop 359196911 Apr 24 09:56 hadoop-3.2.1.tar.gz
lrwxrwxrwx 1 hadoop hadoop 13 Apr 24 09:57 java -> jdk1.8.0_181/
drwxr-xr-x 7 hadoop hadoop 245 Jul 7 2018 jdk1.8.0_181
-rw-r--r-- 1 hadoop hadoop 185646832 Apr 24 09:56 jdk-8u181-linux-x64.tar.gz
- 进入Hadoop的配置目录,修改配置文件
目的是为了指定JDK和Hadoop的安装路径
[hadoop@server21 ~]$ cd hadoop/etc/hadoop/
[hadoop@server21 hadoop]$ vim hadoop-env.sh
export JAVA_HOME=/home/hadoop/java
export HADOOP_HOME=/home/hadoop/hadoop
2. Hadoop运行方式一:单机模式
安装单机模式的Hadoop无需配置,在这种方式下,Hadoop被认为是一个单独的Java进程,这种方式经常被用来调试
- 先创建一个input目录
[hadoop@server21 hadoop]$ cd /home/hadoop/hadoop
[hadoop@server21 hadoop]$ ls
bin etc include lib libexec LICENSE.txt NOTICE.txt README.txt sbin share
[hadoop@server21 hadoop]$ mkdir input
[hadoop@server21 hadoop]$ cp etc/hadoop/*.xml input
- 运行Hadoop自带程序grep程序
运行结束后匹配统计结果已经被写入了HDFS的output目录下
(output目录会被自动建立)
(使用./bin/hdfs dfs -cat output2/*
可以直接查看)
[hadoop@server21 ~]$ cd hadoop
[hadoop@server21 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
[hadoop@server21 hadoop]$ ls input/
capacity-scheduler.xml hadoop-policy.xml httpfs-site.xml kms-site.xml yarn-site.xml
core-site.xml hdfs-site.xml kms-acls.xml mapred-site.xml
[hadoop@server21 hadoop]$ cd output/
[hadoop@server21 output]$ ls
part-r-00000 _SUCCESS
[hadoop@server21 output]$ cat *
1 dfsadmin
3. Hadoop运行方式二:伪分布式
伪分布式Hadoop配置可以把伪分布式的Hadoop视为只有一个节点的集群。
在这个集群中,这个节点既是Master
,又是Slave
;既是NameNode
,又是DataNode
;既是JobTracker
,又是TaskTracker
3.1 基本配置
- 修改配置文件
修改Hadoop核心的配置文件
这里配置的是HDFS的地址和端口号
[hadoop@server21 ~]$ cd hadoop/etc/hadoop/
[hadoop@server21 hadoop]$ vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- 修改配置文件
修改Hadoop中的HDFS的配置,配置的备份方式默认是3,但是在单机模式的hadoop要修改成1
[hadoop@server21 hadoop]$ vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- 配置SSH免密码登陆
[hadoop@server21 hadoop]$ ssh-keygen
[hadoop@server21 hadoop]$ ssh-copy-id localhost
[hadoop@server21 hadoop]$ ll workers
-rw-r–r-- 1 hadoop hadoop 10 Sep 10 2019 workers
[hadoop@server21 hadoop]$ cat workers
localhost
测试是否可以免密登陆
[hadoop@server21 hadoop]$ ssh localhost
Last login: Sat Apr 24 09:55:25 2021
[hadoop@server21 ~]$ pwd
/home/hadoop
- 在启动hadoop之前,需要先进行格式化Hadoop的文件系统HDFS
[hadoop@server21 hadoop]$ pwd
/home/hadoop/hadoop
[hadoop@server21 hadoop]$ bin/hdfs namenode -format
- 最后一次确定该有的数据是否准备好
在临时数据目录/tmp
下是否有Hadoop的数据
[hadoop@server21 hadoop]$ id
uid=1000(hadoop) gid=1000(hadoop) groups=1000(hadoop)
[hadoop@server21 hadoop]$ ls /tmp/
hadoop hadoop-hadoop hadoop-hadoop-namenode.pid hsperfdata_hadoop
- 现在可以启动
注意,这里我忘记写本地解析文件/etc/hosts
[hadoop@server21 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [server21]
server21: Warning: Permanently added 'server21,fe80::5054:ff:fe24:dfa8%eth0' (ECDSA) to the list of known hosts.
- 用命令
jps
可以查看到Hadoop是否启动成功
可以看到,现在,这台主机,既是NameNode,又是DataNode
相应的9000端口也打开了
[hadoop@server21 ~]$ jps
15040 DataNode
15448 Jps
14941 NameNode
15183 SecondaryNameNode
[root@server21 ~]# netstat -antpl | grep java
tcp 0 0 127.0.0.1:9000 0.0.0.0:* LISTEN 14941/java
tcp 0 0 0.0.0.0:9864 0.0.0.0:* LISTEN 15040/java
tcp 0 0 0.0.0.0:9866 0.0.0.0:* LISTEN 15040/java
tcp 0 0 0.0.0.0:9867 0.0.0.0:* LISTEN 15040/java
tcp 0 0 0.0.0.0:9868 0.0.0.0:* LISTEN 15183/java
tcp 0 0 0.0.0.0:9870 0.0.0.0:* LISTEN 14941/java
tcp 0 0 127.0.0.1:35557 0.0.0.0:* LISTEN 15040/java
tcp 0 0 127.0.0.1:58384 127.0.0.1:9000 ESTABLISHED 15040/java
tcp 0 0 127.0.0.1:9000 127.0.0.1:58384 ESTABLISHED 14941/java
3.2 Hadoop基本命令及测试
- 列出当前目录
很明显,现在Hadoop中没有数据
[hadoop@server21 hadoop]$ bin/hdfs dfs -ls
ls: `.': No such file or directory
- 创建用户和用户家目录
现在再去ls
可以看到,这些目录是存在的,只是现在还是空目录
[hadoop@server21 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server21 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server21 hadoop]$ bin/hdfs dfs -ls
[hadoop@server21 hadoop]$
- 上传
input
目录到Hadoop
[hadoop@server21 hadoop]$ ls
bin include lib LICENSE.txt NOTICE.txt README.txt share
etc input libexec logs output sbin
[hadoop@server21 hadoop]$ rm -fr output/
[hadoop@server21 hadoop]$ bin/hdfs dfs -put input
- 假如删除本地的
input
目录,通过网页查看Hadoop内的资源,发现input
目录依旧存在。
这里再次运行Hadoop自带程序wordcount
首先在/usr/local/hadoop/share/hadoop/mapreduce
目录下面找到自带包的版本
运行结束后词频统计结果已经被写入了HDFS的output目录下,执行命令bin/hdfs dfs -cat output/*
查看词频统计结果
[hadoop@server21 hadoop]$ rm -fr input/
[hadoop@server21 hadoop]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount input output
[hadoop@server21 hadoop]$ bin/hdfs dfs -cat output/*
[hadoop@server21 hadoop]$ bin/hdfs dfs -ls
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2021-04-24 11:18 input
drwxr-xr-x - hadoop supergroup 0 2021-04-24 11:21 output
- 下载数据到本地
[hadoop@server21 hadoop]$ bin/hdfs dfs -get output
2021-04-24 11:26:01,017 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
[hadoop@server21 output]$ ls
part-r-00000 _SUCCESS
- 删除Hadoop中的数据(不能在网页上删除,因为我们在登陆这个web界面时,采用的是匿名用户身份,没有删除的权限)
[hadoop@server21 output]$ cd ..
[hadoop@server21 hadoop]$ ls
bin include libexec logs output sbin
etc lib LICENSE.txt NOTICE.txt README.txt share
[hadoop@server21 hadoop]$ rm -fr output/
[hadoop@server21 hadoop]$ bin/hdfs dfs -rm -r output
Deleted output
4. Hadoop运行方式三:完全分布式
Hadoop的配置文件分为2类:
(1)只读类型的默认文件:src/core/core-default.xml
、src/hdfs/hdfs-default.xml
、src/mapred/mapred-default.xml
(2)定位设置:etc/hadoop/core-default.xml
、etc/hadoop/hdfs-default.xml
、etc/hadoop/mapred-default.xml
除此之外,也可以通过设置conf/Hadoop-env.sh
来为Hadoop的守护进程设置环境变量(在bin
目录中)
在Hadoop的设置中,Hadoop的配置通过资源定位的。每个资源由一系列
name/value
对以XML文件的形式构成,它以一个字符串命名或者是以Hadoop定义的Path类命名
4.1 NameNode和DataNode的准备工作
- 添加2个节点
要求全平台一致
注意:这里,我将server21作为NameNode,server22和server23作为DataNode
(1)写本地解析文件
(2)创建Hadoop用户
(3)设定Hadoop用户的密码
[root@server22 ~]# vim /etc/hosts
[root@server22 ~]# useradd hadoop
[root@server22 ~]# echo westos | passwd --stdin hadoop
Changing password for user hadoop.
passwd: all authentication tokens updated successfully.
[root@server23 ~]# vim /etc/hosts
[root@server23 ~]# useradd hadoop
[root@server23 ~]# echo westos | passwd --stdin hadoop
Changing password for user hadoop.
passwd: all authentication tokens updated successfully.
- 关闭之前伪分布式开启的进程
[hadoop@server21 hadoop]$ sbin/stop-dfs.sh
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [server21]
server21: Warning: Permanently added the ECDSA host key for IP address '172.25.21.21' to the list of known hosts.
- 3个节点分别安装NFS
(1个NameNode和2个DataNode)
[root@server21 ~]# yum install -y nfs-utils.x86_64
[root@server22 ~]# yum install -y nfs-utils.x86_64
[root@server23 ~]# yum install -y nfs-utils.x86_64
- NameNode(server21)安装NFS之后,需要写配置文件,将
/home/hadoop
目录共享出去
[hadoop@server21 ~]# yum install -y nfs-utils.x86_64
[root@server21 ~]# vim /etc/exports
/home/hadoop *(rw,anonuid=1000,anongid=1000)
[root@server21 ~]# systemctl start nfs
[root@server21 ~]# showmount -e
Export list for server21:
/home/hadoop *
- DataNode挂载共享目录
(只有超级用户才可以挂载)
这里为什么要挂载?
因为,在部署集群时,每个节点都需要进行配置文件的修改以及各种操作,一个一个修改就太麻烦了,采用网络文件的方法将目录共享出去,这样,源目录(server21的/home/hadoop
)有任何操作,其余的2台主机就会有什么样的操作
(这样也避免了很多不必要的错误)
(本来就要求平台一致嘛)
[root@server22 ~]# showmount -e 172.25.21.21
Export list for 172.25.21.21:
/home/hadoop *
[root@server22 ~]# mount 172.25.21.21:/home/hadoop/ /home/hadoop/
[root@server22 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1164524 16646932 7% /
devtmpfs 1011448 0 1011448 0% /dev
tmpfs 1023468 0 1023468 0% /dev/shm
tmpfs 1023468 16996 1006472 2% /run
tmpfs 1023468 0 1023468 0% /sys/fs/cgroup
/dev/vda1 1038336 135076 903260 14% /boot
tmpfs 204696 0 204696 0% /run/user/0
172.25.21.21:/home/hadoop 17811456 3001344 14810112 17% /home/hadoop
[root@server22 ~]# su - hadoop
[hadoop@server22 ~]$ ls
hadoop hadoop-3.2.1 hadoop-3.2.1.tar.gz java jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[root@server23 ~]# showmount -e 172.25.21.21
Export list for 172.25.21.21:
/home/hadoop *
[root@server23 ~]# mount 172.25.21.21:/home/hadoop/ /home/hadoop/
[root@server23 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1164528 16646928 7% /
devtmpfs 1011448 0 1011448 0% /dev
tmpfs 1023468 0 1023468 0% /dev/shm
tmpfs 1023468 16996 1006472 2% /run
tmpfs 1023468 0 1023468 0% /sys/fs/cgroup
/dev/vda1 1038336 135076 903260 14% /boot
tmpfs 204696 0 204696 0% /run/user/0
172.25.21.21:/home/hadoop 17811456 3001344 14810112 17% /home/hadoop
[root@server23 ~]# su - hadoop
[hadoop@server23 ~]$ ls
hadoop hadoop-3.2.1 hadoop-3.2.1.tar.gz java jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
- NameNode和DataNode之间需要做免密
因为刚刚DataNode都挂载了server21的/home/hadoop
,所以,在挂载的目录中就包含有SSH的公私钥对,因此,在这里,我们就不需要再去生成密钥进行scp了;
现在,我们只需要进行免密的验证
[root@server21 ~]# su - hadoop
[hadoop@server21 ~]$ ssh server22
[hadoop@server21 ~]$ ssh server23
[root@server22 ~]# su - hadoop
[hadoop@server22 ~]$ ssh server21
[hadoop@server22 ~]$ ssh server23
[root@server23 ~]# su - hadoop
[hadoop@server23 ~]$ ssh server21
[hadoop@server23 ~]$ ssh server22
4.2 Hadoop集群的配置
- 配置Hadoop核心的配置文件
注意:代码中的server21是解析文件中的域名(是master的域名)
[hadoop@server21 hadoop]$ pwd
/home/hadoop/hadoop/etc/hadoop
[hadoop@server21 hadoop]$ vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server21:9000</value>
</property>
</configuration>
- 修改slaves文件
server22和server23是slave(DataNode)
[hadoop@server21 hadoop]$ vim workers
server22
server23
- 修改Hadoop的HDFS的配置文件
[hadoop@server21 hadoop]$ vim hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
- 删除之前伪分布式操作留下的临时数据
[hadoop@server21 hadoop]$ rm -fr /tmp/*
- 格式化Hadoop文件系统
[hadoop@server21 hadoop]$ pwd
/home/hadoop/hadoop
[hadoop@server21 hadoop]$ bin/hdfs namenode -format
- 启动Hadoop
[hadoop@server21 hadoop]$ sbin/start-dfs.sh
Starting namenodes on [server21]
Starting datanodes
Starting secondary namenodes [server21]
[hadoop@server21 hadoop]$ jps
17892 Jps
17752 SecondaryNameNode
17562 NameNode
- 在DataNode可以查看运行中的进程
[hadoop@server22 ~]$ jps
4761 Jps
4698 DataNode
测试(存数据)
- 生成一个bigfile(200M)
该文件会被分成2个block,因为Hadoop默认一个主机只能存储128M
[hadoop@server21 hadoop]$ dd if=/dev/zero of=bigfile bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 0.144164 s, 1.5 GB/s
- 使用以下命令,查看一下hadoop的集群状况
[hadoop@server21 hadoop]$ bin/hdfs dfsadmin -report
- 上传
[hadoop@server21 hadoop]$ bin/hdfs dfs -mkdir /user
[hadoop@server21 hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
[hadoop@server21 hadoop]$ bin/hdfs dfs -put bigfile
2021-04-24 11:59:27,651 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-04-24 11:59:29,933 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
4.3 在线添加节点(热添加)
- 又添加一个DataNode
记得保证全平台一致
(1)写本地解析
(2)创建用户
(3)安装NFS,挂载NameNode 的共享目录(因为挂载的原因,这台主机就不需要给新用户Hadoop设置密码了)
(4)使用命令jps
,检查一下环境是否已经部署完毕
(OK)
[root@server24 ~]# vim /etc/hosts
[root@server24 ~]# useradd hadoop
[root@server24 ~]# yum install -y nfs-utils
[root@server24 ~]# mount 172.25.21.21:/home/hadoop/ /home/hadoop/
[root@server24 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1164540 16646916 7% /
devtmpfs 1011448 0 1011448 0% /dev
tmpfs 1023468 0 1023468 0% /dev/shm
tmpfs 1023468 16996 1006472 2% /run
tmpfs 1023468 0 1023468 0% /sys/fs/cgroup
/dev/vda1 1038336 135076 903260 14% /boot
tmpfs 204696 0 204696 0% /run/user/0
172.25.21.21:/home/hadoop 17811456 3206400 14605056 19% /home/hadoop
[root@server24 ~]# su - hadoop
[hadoop@server24 ~]$ ls
hadoop hadoop-3.2.1 hadoop-3.2.1.tar.gz java jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[hadoop@server24 ~]$ jps
4687 Jps
- 还要记得,新添加的server24作为DataNode(slave)要写入
workers
文件中
[hadoop@server24 ~]$ cd hadoop/etc/hadoop/
[hadoop@server24 hadoop]$ vim workers
server22
server23
server24
- 启动集群
注意:这里启动集群不同与之前的方式(之前的Hadoop我没有关闭=热添加)
[hadoop@server24 hadoop]$ bin/hdfs --daemon start datanode
[hadoop@server24 hadoop]$ jps
4767 DataNode
4799 Jps
测试(模拟用户上传数据)
- 假如现在客户端要求上传数据,客户端已经通过NameNode得到了DataNode的信息。
客户端将文件demo分割成2个block,分别交给指定的DataNode,DataNode复制保存block
[hadoop@server24 hadoop]$ ls
bigfile etc lib LICENSE.txt NOTICE.txt sbin
bin include libexec logs README.txt share
[hadoop@server24 hadoop]$ mv bigfile demo
[hadoop@server24 hadoop]$ bin/hdfs dfs -put demo
2021-04-24 12:13:04,467 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2021-04-24 12:13:06,858 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
- 在DataNode上查看核心配置文件,NameNode的IP和端口号如下
[hadoop@server24 hadoop]$ pwd
/home/hadoop/hadoop/etc/hadoop
[hadoop@server24 hadoop]$ cat core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://server21:9000</value>
</property>
</configuration>
- 通过网页测试可以看到,block有2个,分别在server24和server22上。
在Hadoop中,网络拓扑结构、机器节点及机架的网络位置定位都是通过树结构来描述的。通过树结构来确定节点间的距离,这个距离是Hadoop做决策判断时的参考因素。
NameNode也通过这个距离来决定应该把数据副本放在哪里。