大数据讲解包括以上所有(除过算法和开发)
1.大数据
1.1大数据定义
1.2大数据的特性
1.3大数据与Hadoop的关系
2.Hadoop
2.1Hadoop常用组件-主要学习核心组件(HDFS,MapReduce,Yarn)
2.2Hadoop核心组件
集群里机器数据一样,分布式机器数据不一致
2.3Hadoop生态系统
黄色代表的是核心组件,为基础
3.HDFS
3.1HDFS结构
client负责切分数据成块,每块128M;切完块后,每块放置在DataNode中,但client不知道存放在哪块DataNode中,问NameNode,NameNode会告诉存放在哪块DataNode(如第二台,可能返回不止一台机子,便于存储副本),下次取得时候,同样问NameNode,NameNode会告诉你从哪个DataNode取数据,可能返回好几台台机子(存有副本的机子),这几个副本数据一致,从哪个副本取都可以
DataNode(第二台)在client存取数据后都会向NameNode汇报情况,便于NameNode掌握情况
NameNode会将数据记在小本本上,记录哪个数据存放在哪台DataNode上,这个小本本就是HDFS的名称空间额数据块映射信息,叫做fsimage.NameNode决定存储几个副本,想存几个副本,就告诉client有几台机器
如果相对数据进行修改,但NameNode没有时间进行修改,NameNode将想修改的数据发给Secondary NameNode,由Secondary NameNode修改数据,可以对数据打补丁,称为fsedits(变更日志),定期合并fsimage和fsedits(先发送原数据,再发送补丁,保证数据正确),最后返回给NameNode.由于是定期合并,当NameNode的数据不慎丢失,Secondary NameNode还可将最近合并的fimage发送给NameNode使用
3.2HDFS角色及概念
角色标识该组件对整个系统的作用
4.MapReduce
4.1MapReduce结构
client将任务传输给job tracker,job tracker将任务分配给task tracker(并发执行),当task tracker完成任务后,汇报给job tracker
4.2MapReduce角色及概念
5.Yarn
5.1Yarn结构
5.2Yarn角色及概念
6.安装Hadoop
Hadoop模式:
单机
伪分布式
完全分布式
6.1 问题
本案例要求安装单机模式Hadoop:
单机模式安装Hadoop
安装JAVA环境,安装jps工具
设置环境变量,启动运行
6.2 步骤
实现此案例需要按照如下步骤进行。
步骤一:环境准备
1)配置主机名为nn01,ip为192.168.1.60,配置yum源(系统源)
备注:由于在之前的案例中这些都已经做过,这里不再重复
2)安装java环境
[root@nn01 ~]# yum -y install java-1.8.0-openjdk-devel 因为依赖关系,会自动安装java-1.8.0-openjdk
[root@nn01 ~]# java -version
openjdk version “1.8.0_131”
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
[root@nn01 ~]# jps 用来查看本机运行的java程序和pid号
1215 Jps pid 名称
[root@nn01 ~]# ping localhost Hadoop对dns强依赖,必须能ping通
PING 192.168.1.60 (192.168.1.60) 56(84) bytes of data.
64 bytes from 192.168.1.60: icmp_seq=1 ttl=255 time=0.017 ms
64 bytes from 192.168.1.60: icmp_seq=2 ttl=255 time=0.022 ms
64 bytes from 192.168.1.60: icmp_seq=3 ttl=255 time=0.018 ms
--- 192.168.1.60 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.017/0.019/0.022/0.002 ms
3)安装hadoop
[root@room9pc01 ~]# cd /linux-soft/04/hadoop
[root@room9pc01 hadoop]# scp -r .* root@192.168.1.60:/root
zookeeper-3.4.13.tar.gz 100% 35MB 79.0MB/s 00:00
kafka_2.12-2.1.0.tgz 100% 53MB 83.7MB/s 00:00
hadoop-2.7.7.tar.gz 100% 209MB 32.4MB/s 00:06
scp: error: unexpected filename: ..
[root@nn01 ~]# ls
hadoop-2.7.7.tar.gz kafka_2.12-2.1.0.tgz zookeeper-3.4.13.tar.gz
[root@nn01 hadoop]# tar -xf hadoop-2.7.7.tar.gz
[root@nn01 hadoop]# mv hadoop-2.7.7 /usr/local/hadoop
[root@nn01 hadoop]# cd /usr/local/hadoop
[root@nn01 hadoop]# ls
bin include libexec NOTICE.txt sbin
etc lib LICENSE.txt README.txt share
[root@nn01 hadoop]# ./bin/hadoop //报错,JAVA_HOME没有找到
Error: JAVA_HOME is not set and could not be found.
[root@nn01 hadoop]#
4)解决报错问题
[root@nn01 hadoop]# rpm -ql java-1.8.0-openjdk
[root@nn01 hadoop]# cd /usr/local/hadoop/etc/hadoop/
[root@nn01 hadoop]# vim hadoop-env.sh
+25 export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-2.b14.el7.x86_64/jre" java家目录
+33 export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" java配置文件目录
[root@nn01 ~]# cd /usr/local/hadoop/
[root@nn01 ~]# /usr/local/hadoop/bin/hadoop 执行文件路径
Usage: hadoop [--config confdir] [COMMAND | CLASSNAME]
CLASSNAME run the class named CLASSNAME
or
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
note: please use "yarn jar" to launch
YARN applications, not this command.
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
[root@nn01 ~]# mkdir /usr/local/hadoop/oo 随便建一个文件夹
[root@nn01 ~]# ls
bin etc include lib libexec LICENSE.txt NOTICE.txt oo README.txt sbin share
[root@nn01 ~]# cp /usr/local/hadoop/*.txt /usr/local/hadoop/oo
[root@nn01 hadoop]# /usr/local/hadoop/bin/hadoop jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount oo xx
wordcount(单词数量统计)为参数 统计oo这个文件夹,存到xx这个文件里面(这个文件不能存在,要是存在会报错,是为了防止数据覆盖),jar是java程序编写好的文件,后面是一个jar文件路径share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar
[root@nn01 hadoop]# cat xx/part-r-00000 //查看
7.伪分布式
同完全分布式,故不详细讲解
7.1伪分布式定义
伪分布式的安装和完全分布式类似,区别是所有角色安装在一台机器上,使用本地磁盘,一般生产环境都会使用完全分布式,伪分布式一般用于学习和测试hadoop的功能.
7.2文件格式
- hadoop-env.sh
JAVA_HOME
HADOOP_CONF_DIR - xml文件配置格式
关键字
变量值
描述
访问:https://hadoop.apache.org/docs/进入页面
鼠标下拉,找到最后一列
最后一列内容是:
core-default.xml core-default.xml默认配置文件
hdfs-default.xml hdfs-default.xml默认配置文件
mapred-default.xml mapred-default.xml默认配置文件
yarn-default.xml yarn-default.xml默认配置文件
点击core-default.xml进入链接,可以看到3列内容,分别对应name,value,description,刚好对应配置文件的3个变量
关键字 在name列查看,找到匹配项直接填写,不能改动
变量值 在value列查看,找到匹配项,可以改动
描述 在description列查看,描述信息,可以不写,可以自己添加
快速查找,敲击/可出现查找栏,进行快速查找
8.完全分布式
8.1 问题
本案例要求:
另备三台虚拟机,安装Hadoop
使所有节点能够ping通,配置SSH信任关系
节点验证
8.2 方案
准备四台虚拟机,由于之前已经准备过一台,所以只需再准备三台新的虚拟机即可,安装hadoop,使所有节点可以ping通,配置SSH信任关系,如图所示:
DataNode具有高可用性,来源于NameNode自带的副本属性,DataNode损坏数量<=NameNode副本数量-1 hadoop配置在集群中都一样,做完一台scp即可
8.3 步骤
实现此案例需要按照如下步骤进行。
步骤一:环境准备
1)三台机器配置主机名为node1、node2、node3,配置ip地址(ip如上图所示),yum源(系统源)
禁用selinux,禁用firewalld
2)编辑/etc/hosts(四台主机同样操作,以nn01为例)必须都能ping通,且通过主机名也能通
[root@nn01 ~]# vim /etc/hosts
192.168.1.60 nn01
192.168.1.61 node1
192.168.1.62 node2
192.168.1.63 node3
3)安装java环境,在node1,node2,node3上面操作(以node1为例)
[root@node1 ~]# yum -y install java-1.8.0-openjdk-devel
4)布置SSH信任关系
[root@nn01 ~]# vim /etc/ssh/ssh_config 第一次登陆不需要输入yes
Host *
GSSAPIAuthentication yes
StrictHostKeyChecking no //将此项改为no
[root@nn01 .ssh]# ssh-keygen 一路回车
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:Ucl8OCezw92aArY5+zPtOrJ9ol1ojRE3EAZ1mgndYQM root@nn01
The key's randomart image is:
+---[RSA 2048]----+
| o*E*=. |
| +XB+. |
| ..=Oo. |
| o.+o... |
| .S+.. o |
| + .=o |
| o+oo |
| o+=.o |
| o==O. |
+----[SHA256]-----+
[root@nn01 .ssh]# for i in {60…63} ; do ssh-copy-id 192.168.1.$i; done
//部署公钥给nn01,node1,node2,node3 必须将60加上,都要给公钥
5)测试信任关系 无密码ssh登录
[root@nn01 .ssh]# ssh node1
Last login: Fri Sep 7 16:52:00 2018 from 192.168.1.60
[root@node1 ~]# exit
logout
Connection to node1 closed.
[root@nn01 .ssh]# ssh node2
Last login: Fri Sep 7 16:52:05 2018 from 192.168.1.60
[root@node2 ~]# exit
logout
Connection to node2 closed.
[root@nn01 .ssh]# ssh node3
root@node3~]# exit
logout
Connection to node3 closed.
步骤二: HDFS完全分布式系统配置文件
-
环境配置文件:/usr/local/hadoop/etc/hadoop/hadoop-env.sh
-
核心配置文件:/usr/local/hadoop/etc/hadoop/core-site.xml
-
HDFS配置文件:/usr/local/hadoop/etc/hadoop/hdfs-site.xml
-
节点配置文件:/usr/local/hadoop/etc/hadoop/slaves
1)修改slaves文件
[root@nn01 ~]# cd /usr/local/hadoop/etc/hadoop
[root@nn01 hadoop]# ls
capacity-scheduler.xml httpfs-env.sh mapred-env.sh
configuration.xsl httpfs-log4j.properties mapred-queues.xml.template
container-executor.cfg httpfs-signature.secret mapred-site.xml.template
==core-site.xml== httpfs-site.xml ==slaves==
hadoop-env.cmd kms-acls.xml ssl-client.xml.example
==hadoop-env.sh== kms-env.sh ssl-server.xml.example
hadoop-metrics2.properties kms-log4j.properties yarn-env.cmd
hadoop-metrics.properties kms-site.xml yarn-env.sh
hadoop-policy.xml log4j.properties ==yarn-site.xml==
==hdfs-site.xml== mapred-env.cmd
重点标注出的配置文件都为空,必须自己填写,参考格式可参考上面7.2文件格式
[root@nn01 hadoop]# vim slaves 节点信息
node1
node2
node3
2)hadoop的核心配置文件core-site.xml
[root@nn01 hadoop]# vim core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name> ==定义默认的文件系统配置参数==
<value>hdfs://nn01:9000</value> ==原本为file:///,此处改为hdfs://nn01:9000==
</property>
<property>
<name>hadoop.tmp.dir</name> ==定义hadoop核心数据目录配置参数==
<value>/var/hadoop</value> ==/tmp/hadoop-${user.name}作为存储整个hadoop的核心数据,放入/tmp每次开机会自动清除,一般放置在/var下==
</property>
</configuration>
3)配置hdfs-site文件
[root@nn01 hadoop]# vim hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name> ==定义NameNode的地址==
<value>nn01:50070</value> ==dfs.namenode.http-address声明NameNode在nn01这台机器上==
</property>
<property>
<name>dfs.namenode.secondary.http-address</name> ==定义NameNode.secondary的地址==
<value>nn01:50090</value> ==dfs.namenode.secondary.http-address声明NameNode.secondary在nn01这台机器上==
</property>
<property>
<name>dfs.replication</name> ==定义数据冗余的份数(副本份数)==
<value>2</value> ==2份==
</property>
</configuration>
4)hadoop所有节点的配置参数完全一样,在一台配置好后,把配置文件同步到其他所有主机上
同步配置到node1,node2,node3
[root@nn01 hadoop]# for i in {61…63} ; do rsync -aSH --delete /usr/local/hadoop/ 192.168.1.KaTeX parse error: Expected 'EOF', got '&' at position 32: …oop/ -e 'ssh' &̲ done ==其中rsy…i:/usr/local/hadoop/ -e ‘ssh’
[2]- Done rsync -aSH --delete /usr/local/hadoop/ 192.168.1.
i
:
/
u
s
r
/
l
o
c
a
l
/
h
a
d
o
o
p
/
−
e
′
s
s
h
′
[
3
]
+
D
o
n
e
r
s
y
n
c
−
a
S
H
−
−
d
e
l
e
t
e
/
u
s
r
/
l
o
c
a
l
/
h
a
d
o
o
p
/
192.168.1.
i:/usr/local/hadoop/ -e 'ssh' [3]+ Done rsync -aSH --delete /usr/local/hadoop/ 192.168.1.
i:/usr/local/hadoop/−e′ssh′[3]+Donersync−aSH−−delete/usr/local/hadoop/192.168.1.i:/usr/local/hadoop/ -e ‘ssh’
步骤三:格式化
[root@nn01 hadoop]# cd /usr/local/hadoop/
[root@nn01 hadoop]# ./bin/hdfs namenode -format //格式化 namenode
出现successfully
[root@nn01 hadoop]# ./sbin/start-dfs.sh //启动
[root@nn01 hadoop]# jps 查看本机上运行的java程序和pid号
28897 SecondaryNameNode
28706 NameNode
29006 Jps
[root@nn01 hadoop]# ssh node1
[root@node1 ~]# jps 查看本机上运行的java程序和pid号
24182 DataNode
24285 Jps
[root@nn01 hadoop]# ssh node2
[root@node2 ~]# jps 查看本机上运行的java程序和pid号
24208 Jps
24105 DataNode
[root@nn01 hadoop]# ssh node3
[root@node3 ~]# jps 查看本机上运行的java程序和pid号
24086 DataNode
24189 Jps
[root@nn01 hadoop]# ./bin/hdfs dfsadmin -report //查看集群是否组建成功
Configured Capacity: 96602099712 (89.97 GB)
Present Capacity: 91949723648 (85.63 GB)
DFS Remaining: 91949711360 (85.63 GB)
DFS Used: 12288 (12 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (3): //有三个角色成功
Name: 192.168.1.61:50010 (node1)
Hostname: node1
Decommission Status : Normal
Configured Capacity: 32200699904 (29.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1550790656 (1.44 GB)
DFS Remaining: 30649905152 (28.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.18%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 17 17:42:17 CST 2019
Name: 192.168.1.62:50010 (node2)
Hostname: node2
Decommission Status : Normal
Configured Capacity: 32200699904 (29.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1550794752 (1.44 GB)
DFS Remaining: 30649901056 (28.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.18%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 17 17:42:17 CST 2019
Name: 192.168.1.63:50010 (node3)
Hostname: node3
Decommission Status : Normal
Configured Capacity: 32200699904 (29.99 GB)
DFS Used: 4096 (4 KB)
Non DFS Used: 1550790656 (1.44 GB)
DFS Remaining: 30649905152 (28.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 95.18%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Thu Oct 17 17:42:17 CST 2019
[root@nn01 hadoop]# ls 如果失败了,就查看logs目录
bin include libexec logs oo sbin xx
etc lib LICENSE.txt NOTICE.txt README.txt share
[root@nn01 hadoop]# cd logs
[root@nn01 logs]# ls
hadoop-root-namenode-nn01.log hadoop-root-secondarynamenode-nn01.out
hadoop-root-namenode-nn01.out SecurityAuth-root.audit
hadoop-root-secondarynamenode-nn01.log
先查看hadoop-root-namenode-nn01.out,在查看hadoop-root-namenode-nn01.log,找WARN或者ERROR,进行排错