hadoop入门
概念
hadoop是一个由apache基金会所开发的分布式系统基础架构,主要是用来解决海量数据的存储和海量数据的分析计算问题。hadoop广义上指的是hadoop生态圈。
组成
hadoop1.x由mapreduce、hdfs和common(辅助工具组成)
在1.x的时代,mapredyce同时处理业务逻辑计算和资源的调度,耦合性比较大。
hadoop2.x由mapreduce、yarn、hdfs、common组成,增加的yarn负责资源的调度,mapreduce只负责运算。
HDFS架构
HDFS统称为Hadoop Distributed file system。这里借用尚硅谷的资料进行说明:
1)NameNode:存储文件的元数据。如文件名,文件的目录结构,文件的属性。就像周星驰看的这本武功秘籍的目录一样
2)DataNode:在本地文件系统存储文件快数据,以及块数据的校验和。
3)secondary Namenode(2nn):用来监控hdfs状态的辅助后台程序,每隔一段时间获取hdfs元数据的快照
Yarn架构
1)ResourceManager,简称rm
职责为:
1.处理客户端的请求
2.监控NodeManager
3.启动或者监控ApplicationMaster
4.资源的分配与调度
2)NodeManager(NM)
1.管理节点上的资源
2.处理来自ResourceManager的命令
3.处理来自ApplicationMaster的命令
3)ApplicationMaster(AM)
1.负责数据的切分
2.为应用程序申请资源并分配给内部的任务
3.任务的监控与容错
4)Container
Contianer是Yarn中资源的抽象,它封装了某个节点上的多维度资源,如内存、Cpu、磁盘
MapReduce架构概述
mapreduce将计算过程分为map和reduce两个阶段
map阶段并行处理输入数据
reduce阶段对map结果汇总
hadoop安装
材料准备
虚拟机x3,内存4g,处理器4
在/opt目录下面新建module和software文件夹
将三台主机修改主机名称为node1,node2,node3
hostnamectl set-hostname nodeXXX
然后登出,重新登入。会发现显示出我们修改的名称了。
停止防火墙
[root@localhost ~]#systemctl stop firewalld
#禁止防火墙随着系统启动而启动
[root@localhost ~]#systemctl disable firewalld
#查看防火墙状态
[root@localhost ~]#systemctl status firewalld
禁用selinux
#将SELINUX的值设置为disabled
[root@localhost ~]# vi /etc/selinux/config
#查看是否设置完成
[root@localhost ~]# cat /etc/selinux/config
免密
首先配置host,三台机器都一样
vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.200.11 node1
192.168.200.12 node2
192.168.200.13 node3
我们创建bear用户,密码设置为bear
[root@node1 module]#useradd beqr
[root@node1 module]# passwd bear
然后配置使其具有root权限
vi /etc/sudoers
[root@localhost opt]# chmod 777 /etc/sudoers
在root下面添加一行
root ALL=(ALL) ALL
bear ALL=(ALL) ALL
wq退出后权限改为400[root@localhost opt]# chmod 400 /etc/sudoers
[
先在root用户做这个操作,然后再在bear用户下做这个操作
ssh-keygen -t rsa
然后统一做三个机子的免密,要注意自己也要给自己免密
ssh-copy-id -i node1
ssh-copy-id -i node2
ssh-copy-id -i node3
用ssh node1,ssh node2,ssh node3分别尝试,如果可以免密登陆那就ok了
然后我们用bear也做到到这样
要特别注意,安装hadoop的时候,无论是jdk还是hadoop,都在bear用户下面执行,我下面的截图很多都是在root下面的,其实是之前做其他笔记用的,命令是一样的(加sudo)
安装jdk
在opt下新建module和software两个目录,在opt目录下使用这个命令
sudo chown bear software/ module/
将所有者给到bear用户
在opt/software下面传入包
解压
tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt/module/
然后在/etc/profile中
配置环境变量
使用 vi /etc/rpofile命令,然后一个大写的G直接到文件结尾加入下面两句。
export JAVA_HOME=/opt/module/jdk1.8.0_144
export PATH=
P
A
T
H
:
PATH:
PATH:JAVA_HOME/bin
接着source /etc/profile让其生效
使用java命令查看,如下图,这样我们就成功安装了jdk。
其他两台机同样安装jdk。
安装Hadoop
解压
在/opt/software上上传hadoop
然后解压
tar -zxvf hadoop-2.7.2.tar.gz -C /opt/module/
在/etc/profile上加上这4句
##HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
配置
/opt/module/hadoop-2.7.2/etc/hadoop
目录下
配置核心文件core-site
fs.defaultFS hdfs://node1:9000 hadoop.tmp.dir /opt/module/hadoop-2.7.2/data/tmp配置HDFS
一般地,配置-env的都是配个java路径
配置 hadoop-env.sh
bear]$ vi hadoop-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 hdfs-site.xml
bear]$ vi hdfs-site.xml
在该文件中编写如下配置
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!-- 指定 Hadoop 辅助名称节点主机配置 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>node3:50090</value>
</property>
</configuration>
配置YARN
配置 yarn-env.sh
$ vi yarn-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 yarn-site.xml
$ vi yarn-site.xml
在该文件中增加如下配置
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Reducer 获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定 YARN 的 ResourceManager 的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node2</value>
</property>
</configuration>
配置MapReduce
配置 mapred-env.sh
[atguigu@hadoop102 hadoop]$ vi mapred-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_144
配置 mapred-site.xml(如果只有template文件把该文件复制一份重命名mapred-site.xml)
[atguigu@hadoop102 hadoop]$ cp mapred-site.xml.template
mapred-site.xml
[atguigu@hadoop102 hadoop]$ vi mapred-site.xml
在该文件中增加如下配置
<!-- 指定 MR 运行在 Yarn 上 -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
配置slaves
/opt/module/hadoop-2.7.2/etc/hadoop/slaves
里面内容改为
node1
node2
node3
(把原来那个localhost删除)
启动
集群第一次启动要格式化NameNode
hadoop namenode -format
在node1启动
sbin/start-dfs.sh
这时候的node1
node2
node3
查看node1
http://192.168.200.11:50070/dfshealth.html#tab-overview
在node2上启动yarn
sbin/start-yarn.sh
Web 端查看 SecondaryNameNode
浏览器中输入:http://node3:50090/status.html
附赠:杀掉所有的java进程
ps -ef | grep java | grep -v grep | awk ‘{print $2}’ | xargs kill -9
测试
#1.查看一下,由于刚搭建集群,啥都没有
[bear@node1 hadoop-2.7.2]$ hdfs dfs -ls /
#2.建立一个test文件夹
[bear@node1 hadoop-2.7.2]$ hdfs dfs -mkdir /test
#3.再看看,这时候有了
[bear@node1 hadoop-2.7.2]$ hdfs dfs -ls /
Found 1 items
drwxr-xr-x - bear supergroup 0 2020-07-27 21:53 /test
#4.就地取材,把NOTICE.txt上传上去
[bear@node1 hadoop-2.7.2]$ ls
bin databak include libexec logs NOTICE.txt sbin
data etc lib LICENSE.txt logsbak README.txt share
[bear@node1 hadoop-2.7.2]$ hdfs dfs -put NOTICE.txt /test
[bear@node1 hadoop-2.7.2]$ hdfs dfs -ls /test
Found 1 items
-rw-r–r-- 3 bear supergroup 101 2020-07-27 21:54 /test/NOTICE.txt
#5.试一下wordcount这个经典的
[bear@node1 hadoop-2.7.2]$ cd share/hadoop/mapreduce/
[bear@node1 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /test/NOTICE.txt /test/output
20/07/27 21:58:31 INFO client.RMProxy: Connecting to ResourceManager at node2/192.168.200.12:8032
20/07/27 21:58:33 INFO input.FileInputFormat: Total input paths to process : 1
20/07/27 21:58:33 INFO mapreduce.JobSubmitter: number of splits:1
20/07/27 21:58:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1595857554264_0001
20/07/27 21:58:34 INFO impl.YarnClientImpl: Submitted application application_1595857554264_0001
20/07/27 21:58:34 INFO mapreduce.Job: The url to track the job: http://node2:8088/proxy/application_1595857554264_0001/
20/07/27 21:58:34 INFO mapreduce.Job: Running job: job_1595857554264_0001
20/07/27 21:58:45 INFO mapreduce.Job: Job job_1595857554264_0001 running in uber mode : false
20/07/27 21:58:45 INFO mapreduce.Job: map 0% reduce 0%
20/07/27 21:58:52 INFO mapreduce.Job: map 100% reduce 0%
20/07/27 21:58:58 INFO mapreduce.Job: map 100% reduce 100%
20/07/27 21:58:59 INFO mapreduce.Job: Job job_1595857554264_0001 completed successfully
20/07/27 21:58:59 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=173
FILE: Number of bytes written=235197
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=199
HDFS: Number of bytes written=123
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=4376
Total time spent by all reduces in occupied slots (ms)=3587
Total time spent by all map tasks (ms)=4376
Total time spent by all reduce tasks (ms)=3587
Total vcore-milliseconds taken by all map tasks=4376
Total vcore-milliseconds taken by all reduce tasks=3587
Total megabyte-milliseconds taken by all map tasks=4481024
Total megabyte-milliseconds taken by all reduce tasks=3673088
Map-Reduce Framework
Map input records=2
Map output records=11
Map output bytes=145
Map output materialized bytes=173
Input split bytes=98
Combine input records=11
Combine output records=11
Reduce input groups=11
Reduce shuffle bytes=173
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=383
CPU time spent (ms)=2520
Physical memory (bytes) snapshot=431595520
Virtual memory (bytes) snapshot=4219613184
Total committed heap usage (bytes)=296222720
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=101
File Output Format Counters
Bytes Written=123
然后看看输出
[bear@node1 mapreduce]$ hdfs dfs -ls /test/output
Found 2 items
-rw-r–r-- 3 bear supergroup 0 2020-07-27 21:58 /test/output/_SUCCESS
-rw-r–r-- 3 bear supergroup 123 2020-07-27 21:58 /test/output/part-r-00000
[bear@node1 mapreduce]$ hdfs dfs -text /test/output/part-r-00000
(http://www.apache.org/). 1
Apache 1
Foundation 1
Software 1
The 1
This 1
by 1
developed 1
includes 1
product 1
software 1
这样我们的hadoop集群搭建完了