1.Hadoop介绍
广义: 以Apache hadoop软件为主的生态圈 也包含 hive sqoop hbase kafka spark flink
狭义: 以Apache hadoop软件
以Apache hadoop软件:
主流2.x版本
hdfs 存储
mapreduce 计算(作业)
yarn 资源和作业调度
大数据平台: 存储是第一位 ;存储和计算是相辅相成的
官网:
hadoop.apache.org
hive.apache.org
APACHE版本:
2.x
3.x
生产上至今企业还是以CDH5.x为主 (cloudera公司)
hadoop-2.6.0-cdh5.16.2.tar.gz
2.hadoop hdfs安装
2.1 创建用户和文件夹
[root@pentaKill ~]# useradd hadoop
[root@pentaKill ~]# su - hadoop
[hadoop@pentaKill ~]$ mkdir tmp sourcecode software shell log lib data app
[hadoop@pentaKill ~]$ cd software/
[hadoop@pentaKill software]$ ll
total 1266604
-rw-r--r-- 1 root root 434354462 Feb 24 14:01 hadoop-2.6.0-cdh5.16.2.tar.gz
-rw-r--r-- 1 hadoop hadoop 185646832 Feb 24 12:03 jdk-8u181-linux-x64.tar.gz
2.2 前面课程已经安装部署jdk
[hadoop@pentaKill software]$ which java
/usr/java/jdk1.8.0_45/bin/java
[hadoop@pentaKill software]$
生产注意点:
https://cwiki.apache.org/confluence/display/HADOOP2/HadoopJavaVersions
https://docs.cloudera.com/documentation/enterprise/release-notes/topics/rn_consolidated_pcm.html#pcm_jdk
[root@pentaKill ~]# cd /home/hadoop/software/
[root@pentaKill software]# tar -xzvf jdk-8u181-linux-x64.tar.gz -C /usr/java
补齐配置步骤
2.3 hadoop解压和软连接
[hadoop@pentaKill software]$ tar -xzvf hadoop-2.6.0-cdh5.16.2.tar.gz -C ../app/
[hadoop@pentaKill app]$
[hadoop@pentaKill app]$ ll
total 4
drwxr-xr-x 14 hadoop hadoop 4096 Jun 3 2019 hadoop-2.6.0-cdh5.16.2
[hadoop@pentaKill app]$ ln -s hadoop-2.6.0-cdh5.16.2 hadoop
[hadoop@pentaKill app]$ ll
total 4
lrwxrwxrwx 1 hadoop hadoop 22 May 6 22:05 hadoop -> hadoop-2.6.0-cdh5.16.2
drwxr-xr-x 14 hadoop hadoop 4096 Jun 3 2019 hadoop-2.6.0-cdh5.16.2
[hadoop@pentaKill app]$
2.4 软连接
1.版本切换
/home/hadoop/app/hadoop
/home/hadoop/app/hadoop-2.6.0-cdh5.16.2
想要升级 代码脚本都有仔细检查修改 2--》3
但是如果提前设置软连接,代码脚本是hadoop,不关心版本多少
2.小盘换大盘
/根目录磁盘 设置的比较小 20G /app/log/hadoop-hdfs 文件夹 18G
/data01
mv /app/log/hadoop-hdfs /data01/ ==>/data01/hadoop-hdfs
ln -s /data01/hadoop-hdfs /app/log/hadoop-hdfs
[hadoop@pentaKill app]$ cd hadoop/etc/hadoop
[hadoop@pentaKill hadoop]$ vi hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_181
显性指定
Local (Standalone) Mode 本地模式
Pseudo-Distributed Mode 伪分布式模式
Fully-Distributed Mode 分布式模式 集群
2.5 配置ssh pentaKill 无密码认证
[hadoop@pentaKill ~]$ rm -rf .ssh
[hadoop@pentaKill ~]$
[hadoop@pentaKill ~]$
[hadoop@pentaKill ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:fhAts9iahMuFy0r/djKCcAO7m8vPm5lf2ExdkWUqIdw hadoop@pentaKill
The key's randomart image is:
+---[RSA 2048]----+
| .... .oo |
| ..E..+ |
| +..o |
|. o o.=o |
| o o +.S. |
|o oo ==+ . |
| +.o=.o+. . |
|oooo= = .. |
|++oB+=.+ |
+----[SHA256]-----+
[hadoop@pentaKill ~]$ cd .ssh
[hadoop@pentaKill .ssh]$
[hadoop@pentaKill .ssh]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@pentaKill .ssh]$ chmod 0600 ~/.ssh/authorized_keys
[hadoop@pentaKill .ssh]$
[hadoop@pentaKill .ssh]$ ssh pentaKill date
The authenticity of host 'pentaKill (192.168.0.3)' can't be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'pentaKill,192.168.0.3' (ECDSA) to the list of known hosts.
Wed May 6 22:26:57 CST 2020
[hadoop@pentaKill .ssh]$
[hadoop@pentaKill .ssh]$
[hadoop@pentaKill .ssh]$ ssh pentaKill date
Wed May 6 22:27:07 CST 2020
[hadoop@pentaKill .ssh]$
2.6 修改配置,且hdfs的三个进程都以pentaKill名称启动
nn启动以pentaKill名称启动
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://pentaKill:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp/</value>
</property>
</configuration>
snn启动以pentaKill名称启动
etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>pentaKill:9868</value>
</property>
<property>
<name>dfs.namenode.secondary.https-address</name>
<value>pentaKill:9869</value>
</property>
</configuration>
dn启动以pentaKill名称启动
[hadoop@pentaKill hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@pentaKill hadoop]$ vi slaves
pentaKill
2.7 格式化,只需第一次即可,格式化自己的编码存储格式
[hadoop@pentaKill hadoop]$ pwd
/home/hadoop/app/hadoop
[hadoop@pentaKill hadoop]$ bin/hdfs namenode -format
2.8 启动
[hadoop@pentaKill hadoop]$ sbin/start-dfs.sh
20/05/06 22:43:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [pentaKill]
pentaKill: starting namenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-namenode-pentaKill.out
pentaKill: starting datanode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-datanode-pentaKill.out
Starting secondary namenodes [pentaKill]
pentaKill: starting secondarynamenode, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-hadoop-secondarynamenode-pentaKill.out
20/05/06 22:43:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill hadoop]$ jps
21712 DataNode dn 存储数据的 小弟
21585 NameNode nn 负责分配数据存储的 老大
21871 SecondaryNameNode snn 万年老二 默认是按1小时粒度去备份老大的数据
21999 Jps
[hadoop@pentaKill hadoop]$
2.9 open web:
http://114.67.101.143:50070/dfshealth.html#tab-overview
2.10 创建文件夹
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -mkdir /user
20/05/06 23:03:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -ls /
20/05/06 23:03:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2020-05-06 23:03 /user
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$ ls /
bin dev home lib64 meta.js opt root sbin sys usr
boot etc lib media mnt proc run srv tmp var
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -mkdir /user/hadoop
2.11 上传linux–>hdfs
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -mkdir /wordcount
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -mkdir /wordcount/input
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -put etc/hadoop/*.xml /wordcount/input/
2.12 计算
bin/hadoop jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
grep /wordcount/input /wordcount/output 'dfs[a-z.]+'
2.13 下载从hdfs–>linux
[hadoop@pentaKill hadoop]$ bin/hdfs dfs -get /wordcount/output output
20/05/06 23:13:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill hadoop]$ cat output/
cat: output/: Is a directory
[hadoop@pentaKill hadoop]$ cd output/
[hadoop@pentaKill output]$ ls
part-r-00000 _SUCCESS
[hadoop@pentaKill output]$ ll
total 4
-rw-r--r-- 1 hadoop hadoop 90 May 6 23:13 part-r-00000
-rw-r--r-- 1 hadoop hadoop 0 May 6 23:13 _SUCCESS
[hadoop@pentaKill output]$ cat part-r-00000
1 dfsadmin
1 dfs.replication
1 dfs.namenode.secondary.https
1 dfs.namenode.secondary.http
[hadoop@pentaKill output]$
3.YARN 部署
3.1YARN部署
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>pentaKill:7776</value>
</property>
</configuration>
1-65535
8088端口暴露外网,会把病毒感染 挖矿程序 计算比特币
Start ResourceManager daemon and NodeManager daemon:
[hadoop@pentaKill hadoop]$ pwd
/home/hadoop/app/hadoop/etc/hadoop
[hadoop@pentaKill hadoop]$ cd ../../
[hadoop@pentaKill hadoop]$ sbin/start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-resourcemanager-pentaKill.out
pentaKill: starting nodemanager, logging to /home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs/yarn-hadoop-nodemanager-pentaKill.out
[hadoop@pentaKill hadoop]$
open web: http://114.67.101.143:7776/cluster
3.2.词频统计
[hadoop@pentaKill data]$ hdfs dfs -mkdir /wordcount2
20/05/09 21:33:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill data]$
[hadoop@pentaKill data]$
[hadoop@pentaKill data]$ hdfs dfs -mkdir /wordcount2/input/
20/05/09 21:33:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill data]$ hdfs dfs -put * /wordcount2/input/
20/05/09 21:34:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill data]$ hdfs dfs -ls /wordcount2/input/
20/05/09 21:34:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 35 2020-05-09 21:34 /wordcount2/input/1.log
-rw-r--r-- 1 hadoop supergroup 29 2020-05-09 21:34 /wordcount2/input/2.log
[hadoop@pentaKill data]$
[hadoop@pentaKill hadoop]$ hadoop jar \
./share/hadoop/mapreduce2/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
wordcount /wordcount2/input/ /wordcount2/output1/
20/05/09 21:38:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/05/09 21:38:08 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/05/09 21:38:09 INFO input.FileInputFormat: Total input paths to process : 2
20/05/09 21:38:10 INFO mapreduce.JobSubmitter: number of splits:2
思考: mapreduce的split=map的个数 是什么影响的
20/05/09 21:38:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1589030910472_0001
20/05/09 21:38:10 INFO impl.YarnClientImpl: Submitted application application_1589030910472_0001
20/05/09 21:38:10 INFO mapreduce.Job: The url to track the job: http://pentaKill:7776/proxy/application_1589030910472_0001/
20/05/09 21:38:10 INFO mapreduce.Job: Running job: job_1589030910472_0001
20/05/09 21:38:17 INFO mapreduce.Job: Job job_1589030910472_0001 running in uber mode : false
20/05/09 21:38:17 INFO mapreduce.Job: map 0% reduce 0%
20/05/09 21:38:22 INFO mapreduce.Job: map 50% reduce 0%
20/05/09 21:38:23 INFO mapreduce.Job: map 100% reduce 0%
20/05/09 21:38:28 INFO mapreduce.Job: map 100% reduce 100%
20/05/09 21:38:28 INFO mapreduce.Job: Job job_1589030910472_0001 completed successfully
20/05/09 21:38:28 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=116
FILE: Number of bytes written=429291
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=288
HDFS: Number of bytes written=41
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=6455
Total time spent by all reduces in occupied slots (ms)=2812
Total time spent by all map tasks (ms)=6455
Total time spent by all reduce tasks (ms)=2812
Total vcore-milliseconds taken by all map tasks=6455
Total vcore-milliseconds taken by all reduce tasks=2812
Total megabyte-milliseconds taken by all map tasks=6609920
Total megabyte-milliseconds taken by all reduce tasks=2879488
Map-Reduce Framework
Map input records=7
Map output records=16
Map output bytes=127
Map output materialized bytes=122
Input split bytes=224
Combine input records=16
Combine output records=11
Reduce input groups=7
Reduce shuffle bytes=122
Reduce input records=11
Reduce output records=7
Spilled Records=22
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=170
CPU time spent (ms)=1350
Physical memory (bytes) snapshot=781856768
Virtual memory (bytes) snapshot=8315179008
Total committed heap usage (bytes)=770179072
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=64
File Output Format Counters
Bytes Written=41
[hadoop@pentaKill hadoop]$
[hadoop@pentaKill hadoop]$ hdfs dfs -ls /wordcount2/output1/
20/05/09 21:41:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2020-05-09 21:38 /wordcount2/output1/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 41 2020-05-09 21:38 /wordcount2/output1/part-r-00000
[hadoop@pentaKill hadoop]$
mapreduce跑完结束,文件的个数=reduce由什么决定,当前是1
part-r-00000
[hadoop@pentaKill hadoop]$ cat part-r-00000
a 4
ab 1
b 1
c 3
data 1
jepson 3
ruoze 3
[hadoop@pentaKill hadoop]$
3.3.jps命令
3.3.1 位置
[hadoop@pentaKill hadoop]$ which jps
/usr/java/jdk1.8.0_181/bin/jps
3.3.2 使用
[hadoop@pentaKill hadoop]$ jps
21712 DataNode
21585 NameNode
23989 ResourceManager
29877 Jps
24094 NodeManager
21871 SecondaryNameNode
3.3.3 对应的进程标识文件存储在哪
/tmp/hsperfdata_hadoop
[hadoop@pentaKill tmp]$ cd hsperfdata_hadoop
[hadoop@pentaKill hsperfdata_hadoop]$ ll
total 160
-rw------- 1 hadoop hadoop 32768 May 9 22:01 21585
-rw------- 1 hadoop hadoop 32768 May 9 22:01 21712
-rw------- 1 hadoop hadoop 32768 May 9 22:01 21871
-rw------- 1 hadoop hadoop 32768 May 9 22:01 23989
-rw------- 1 hadoop hadoop 32768 May 9 22:01 24094
[hadoop@pentaKill hsperfdata_hadoop]$ mv 21712 21712.bak
[hadoop@pentaKill hsperfdata_hadoop]$
[hadoop@pentaKill hsperfdata_hadoop]$ ps -ef|grep DataNode
hadoop 21712 1 0 May06 ? 00:03:54 /usr/java/jdk1.8.0_181/bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop-hadoop-datanode-pentaKill.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
hadoop 30469 30251 0 22:03 pts/1 00:00:00 grep --color=auto DataNode
[hadoop@pentaKill hsperfdata_hadoop]$
[hadoop@pentaKill hadoop]$ jps
21585 NameNode
23989 ResourceManager
30358 Jps
24094 NodeManager
21871 SecondaryNameNode
[hadoop@pentaKill hadoop]$
3.3.4 当出现 process information unavailable
进程不能代表是存在 或者 不存在,要当心,尤其是使用jps检测状态的
ps -ef|grep xxx 命令是去查看进程的真正手段!!!
[root@pentaKill ~]# jps
21712 -- process information unavailable
31713 Jps
23989 -- process information unavailable
24094 NodeManager
[root@pentaKill ~]# ps -ef|grep 23989
root 31791 31644 0 22:11 pts/2 00:00:00 grep --color=auto 23989
[root@pentaKill ~]# ps -ef|grep 21712
hadoop 21712 1 0 May06 ? 00:03:54 /usr/java/jdk1.8.0_181/bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,console -Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/logs -Dhadoop.log.file=hadoop-hadoop-datanode-pentaKill.log -Dhadoop.home.dir=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2 -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/home/hadoop/app/hadoop-2.6.0-cdh5.16.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
root 31812 31644 0 22:11 pts/2 00:00:00 grep --color=auto 21712
[root@pentaKill ~]#
3.3.5 文件被删除,不影响进程的重启!!
3.4.pid文件
3.4.1存储位置 /tmp
[root@pentaKill tmp]# ll
total 20
-rw-rw-r-- 1 hadoop hadoop 4 May 9 22:16 hadoop-hadoop-datanode.pid
-rw-rw-r-- 1 hadoop hadoop 4 May 9 22:16 hadoop-hadoop-namenode.pid
-rw-rw-r-- 1 hadoop hadoop 5 May 9 22:16 hadoop-hadoop-secondarynamenode.pid
3.4.2 维护进程的pid 写死的
[root@pentaKill tmp]# cat hadoop-hadoop-namenode.pid
792
[root@pentaKill tmp]#
3.4.3 文件被删除,影响进程的重启!!
进程启动,pid文件写入进程的pid数字
进程关闭时,从pid文件读出pid数字,然后kill -9 pid
3.4.4 生产上 pid文件真的可以放心的丢在/tmp维护吗?
Linux的/tmp 会有30天的默认删除机制
3.4.5 如何修改
hadoop-env.sh脚本
export HADOOP_PID_DIR= /var/hadoop/pids
总结:pid文件生产不要丢在/tmp目录
要知道是影响进程的启动停止
3.5块 block
一缸水 260ml
瓶子 规格 128ml
260/128=2...4ml
128ml
128ml
4ml
HDFS 存储大文件是利好,存储小文件是损害自己的
适合存储大文件 不适合小文件,不代表不能存储小文件
mv 260m文件
上传到hdfs,会把文件切割成块 dfs.blocksize 134217728 =128M
128m
128m
4m
3个块
伪分布式节点 1节点 副本数 dfs.replication 1
生产上HDFS集群的 DN肯定大于>=3台 dfs.replication=3
DN1 DN2 DN5 DN10
128m b1 b1 b1
128m b2 b2 b2
4m b3 b3 b3
通过设置副本数 来让文件存储在大数据HDFS平台上有容错保障
为什么由64M--》128M
mv 260m文件
260/64=4...4M
b1 64m
b2 64m
b3 64m
b4 64m
b5 4m
5个块
260/128=2...4M
128m
128m
4m
3个块
副本数3:存储实际大小=文件大小*3=260*3=780m 存储空间
5个块的元数据信息维护 是不是 比3个块的元数据的信息维护要多,
维护重 累--》namenode
所以nn是不喜欢 小文件的
1亿个10kb文件 3亿个block
1亿个10kb文件压缩成1KW个文件100m 3KW的block
如何规避 小文件
数据传输到hdfs之前 就合并
数据已经在hdfs上,就是定时 业务低谷的 去合并冷文件
/2020/10/10
/2020/10/11
/2020/10/12 当前时间
20号 14号的文件
21号 15号的文件
一天卡一天
经验值
10m以下的文件 小文件
10m 10m 10m ....10m 12个=120m
10m 10m 10m ....10m 13个=130m
合并 120m文件
128m 真正合并就超了 变为129m--》2个块 128m 1m
4.HDFS架构及原理
4.1.HDFS角色
主从架构
角色:
4.1.1 namenode 名称节点 nn
a.文件名称
b.文件的目录结构
c.文件的属性 权限 创建时间 副本数
[hadoop@pentaKill ~]$ hdfs dfs -ls /wordcount2/input/
20/05/10 20:12:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 hadoop supergroup 35 2020-05-09 21:34 /wordcount2/input/1.log
-rw-r--r-- 1 hadoop supergroup 29 2020-05-09 21:34 /wordcount2/input/2.log
[hadoop@pentaKill ~]$
d.一个文件被对应切割哪些数据块 副本数的块
9块===》数据块对应分布在哪些节点上
blockmap 块映射 当然nn节点是不会持久化存储这种映射关系
是通过集群的启动和运行时,dn定期发送blockreport给nn,然后nn就在内存中动态维护这种映射关系
作用:
管理文件系统的命名空间 其实就是维护文件系统树的文件和文件夹
这些形式是以两种文件来永久的保存在本地磁盘的
镜像文件 fsimage
编辑日志文件 editlogs
[hadoop@pentaKill current]$ pwd
/home/hadoop/tmp/dfs/name/current
[hadoop@pentaKill current]$ ll
edits_0000000000000000375-0000000000000000376
edits_0000000000000000377-0000000000000000378
edits_0000000000000000379-0000000000000000380
edits_0000000000000000381-0000000000000000382
edits_0000000000000000383-0000000000000000384
edits_0000000000000000385-0000000000000000386
edits_inprogress_0000000000000000387
fsimage_0000000000000000384
fsimage_0000000000000000384.md5
fsimage_0000000000000000386
fsimage_0000000000000000386.md5
4.1.2 secondary namenode 第二名称节点 snn
a.fsimage editlog定期拿过来合并 备份 推送
[hadoop@pentaKill current]$ pwd
/home/hadoop/tmp/dfs/namesecondary/current
[hadoop@pentaKill current]$
edits_0000000000000000377-0000000000000000378
edits_0000000000000000379-0000000000000000380
edits_0000000000000000381-0000000000000000382
edits_0000000000000000383-0000000000000000384
edits_0000000000000000385-0000000000000000386
fsimage_0000000000000000384
fsimage_0000000000000000384.md5
fsimage_0000000000000000386
fsimage_0000000000000000386.md5
fsimage_0000000000000000384
snn将老大的 edits_0000000000000000385-0000000000000000386 ==》检查点动作 checkpoint 合并为
fsimage_0000000000000000386 将这个386文件推送给老大,
那么新的数据写读的记录就存放在edits_inprogress_0000000000000000387 日志文件 是变化的 追加
dfs.namenode.checkpoint.period 3600s
dfs.namenode.checkpoint.txns 1000000
为了解决单点故障,nn只有一个对外的,后来新增一个snn,1小时的备份
虽然能够减轻单点故障带来的丢失风险,但是在生产上还是不允许使用snn
11:00 snn 备份
11:30 数据一直写到这 突然nn节点 磁盘故障 无法恢复
拿snn的最新的一个fsimage文件恢复,那么只能恢复 11点的数据
在生产上是 不用snn,是启动另外一个NN进程(实时备份,实时准备替换nn,变为活动的nn)
叫做HDFS HA
4.1.3 datanode 数据节点 dn
存储数据块和数据块的校验和
作用:
a.每隔3s发送心跳给nn,告诉 我还活着
dfs.heartbeat.interval 3s
b.每隔一定的时间发生一次blockreport
dfs.blockreport.intervalMsec 21600000ms=6小时
dfs.datanode.directoryscan.interval 21600s=6小时
[hadoop@pentaKill subdir0]$ pwd
/home/hadoop/tmp/dfs/data/current/BP-672629417-192.168.0.3-1588775739687/current/finalized/subdir0/subdir0
[hadoop@pentaKill subdir0]$
//生产block快的损坏恢复,一般使用自动修复
补充: https://pentaKill.github.io/2019/06/06/%E7%94%9F%E4%BA%A7HDFS%20Block%E6%8D%9F%E5%9D%8F%E6%81%A2%E5%A4%8D%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5(%E5%90%AB%E6%80%9D%E8%80%83%E9%A2%98)/
HDFS副本放置策略
4.2 HDFS写流程
[hadoop@pentaKill ~]$ echo 'pentaKill' >1.log
[hadoop@pentaKill ~]$
[hadoop@pentaKill ~]$ hdfs dfs -put 1.log /
20/05/10 21:19:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@pentaKill ~]$
对用户操作是无感知的
1.HDFS Client调用FileSystem.create(filePath)方法,去和NN进行【RPC】通信!
NN 会去check这个路径的文件是否已经存在,是否有权限能够创建这个文件!
假如都ok,就去创建一个新的文件,但是这时还没写数据,是不关联任何的block。
nn根据上传的文件大小,根据块大小+副本数参数,
计算要上传多少块和块存储在DN的位置。最终将这些信息返回给客户端,
为【FSDataOutputStream】对象
2.Client调用【FSDataOutputStream】对象的write方法,
将第一个块的第一个副本数写第一个DN节点,
写完去第二个副本写DN2;写完去第三个副本写DN3。
当第三个副本写完,就返回一个ack packet确定包给DN2节点,
当DN2节点收到这个ack packet确定包加上自己也是写完了,就返回一个ack packet确定包给
第一个DN1节点,当DN1节点收到这个ack packet确定包加上自己也是写完了,
将ack packet确定包返回给【FSDataOutputStream】对象,就标识第一个块的三个副本写完。,
其他块以此类推!
3.当所有的块全部写完, client调用 【FSDataOutputStream】对象的close方法,
关闭输出流。再次调用FileSystem.complete方法,告诉nn文件写成功!
伪分布式 1台dn 副本数参数必然是设置1
dn挂了,肯定不能写入
生产分布式 3台dn 副本数参数必然是设置3
dn挂了,肯定不能写入
生产分布式 >3台dn 副本数参数必然是设置3
dn挂了,能写入
存活的alive dn节点数>=副本数 就能写成功
4.3 HDFS读流程
[hadoop@pentaKill ~]$ hdfs dfs -cat /1.log
20/05/10 21:53:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
pentaKill
[hadoop@pentaKill ~]$
对用户操作是无感知的
1.Client调用FileSystem的open(filePath),
与NN进行RPC通信,返回该文件的部分或者全部的block列表
也就是返回【FSDataInputStream】对象
2.Client调用【FSDataInputStream】对象的read方法
去与第一个块的最近的DN进行读取,读取完成后,会check,假如ok,就关闭与DN通信。
假如读取失败,会记录DN+block信息,下次就不会从这个节点读取。那么就从第二个节点读取。
然后去与第二个块的最近的DN进行读取,以此类推。
假如当block列表全部读取完成,文件还没读取结束,就调用FileSystem从nn获取下一批次的block列表。
3.Client调用【FSDataInputStream】对象close方法,关闭输入流。
4.HDFS副本放置策略 不光光面试,生产也需要
机架 rack1 5 rack2 5
生产上读写操作 尽量选择 某个DN节点
第一个副本:
放置在上传的DN节点;
假如是非DN节点,就随机挑选一个磁盘不太慢,cpu不太忙的节点;
第二个副本:
放置在第一个副本不同的机架上的某个DN节点。
第三个副本:
与第二个副本相同机架的不同节点上。
如果副本数设置更多,就随机放。
4.4 SNN的流程
4.5 HDFS命令
[hadoop@pentaKill hadoop]$ ll
total 132
drwxr-xr-x 2 hadoop hadoop 128 Jun 3 2019 bin 可执行脚本 命令
drwxr-xr-x 2 hadoop hadoop 4096 Jun 3 2019 bin-mapreduce1
drwxr-xr-x 3 hadoop hadoop 4096 Jun 3 2019 cloudera
drwxr-xr-x 6 hadoop hadoop 105 Jun 3 2019 etc 配置文件夹 conf/config
drwxr-xr-x 5 hadoop hadoop 40 Jun 3 2019 examples
drwxr-xr-x 3 hadoop hadoop 27 Jun 3 2019 examples-mapreduce1
drwxr-xr-x 2 hadoop hadoop 101 Jun 3 2019 include
drwxr-xr-x 3 hadoop hadoop 19 Jun 3 2019 lib
drwxr-xr-x 3 hadoop hadoop 4096 Jun 3 2019 libexec
-rw-r--r-- 1 hadoop hadoop 85063 Jun 3 2019 LICENSE.txt
drwxrwxr-x 3 hadoop hadoop 4096 May 9 22:39 logs
-rw-r--r-- 1 hadoop hadoop 14978 Jun 3 2019 NOTICE.txt
drwxrwxr-x 2 hadoop hadoop 40 May 6 23:13 output
-rw-r--r-- 1 hadoop hadoop 41 May 9 21:43 part-r-00000
-rw-r--r-- 1 hadoop hadoop 1366 Jun 3 2019 README.txt
drwxr-xr-x 3 hadoop hadoop 4096 Jun 3 2019 sbin 启动 停止 重启脚本
drwxr-xr-x 4 hadoop hadoop 29 Jun 3 2019 share
drwxr-xr-x 18 hadoop hadoop 4096 Jun 3 2019 src
[hadoop@pentaKill hadoop]$
4.5.1 HDFS基本命令
查看ls
hadoop fs -ls /
if [ "$COMMAND" = "fs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
hdfs dfs -ls /
elif [ "$COMMAND" = "dfs" ] ; then
CLASS=org.apache.hadoop.fs.FsShell
上传下载:
-get -put
-copyToLocal -copyFromLocal
创建文件夹
-mkdir
移动 拷贝
-mv
-cp
删除
-rm [-f] [-r|-R] [-skipTrash] <src> ...]
4.5.2 HDFS开启垃圾回收站
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
7天
生产上1.开启回收站!+回收站的有效期至少7天!!
2.慎用-skipTrash 不要加!!!
CDH/HDP 去检查是否开启 开启周期
[hadoop@pentaKill hadoop]$ hdfs dfs -rm /1.log
20/05/12 21:45:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/05/12 21:45:27 INFO fs.TrashPolicyDefault: Moved: 'hdfs://pentaKill:9000/1.log' to trash at: hdfs://pentaKill:9000/user/hadoop/.Trash/Current/1.log
[hadoop@pentaKill hadoop]$ hdfs dfs -rm -skipTrash /2.log
20/05/12 21:46:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted /2.log
[hadoop@pentaKill hadoop]$
权限
chmod
chown
[hadoop@pentaKill hadoop]$ hadoop checknative
20/05/12 21:53:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Native library checking:
hadoop: false
zlib: false
snappy: false
lz4: false
bzip2: false
openssl: false
20/05/12 21:53:26 INFO util.ExitUtil: Exiting with status 1
[hadoop@pentaKill hadoop]$
4.5.3 HDFS 安全模式及应用
[-safemode <enter | leave | get | wait>]
安全模式 写不可以 读可以
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave
错误: Name node is in safe mode.
什么时候会安全模式:
1.hdfs故障 nn log日志
根据错误去看看尝试能不能解决,和尝试先手动让他你看安全模式
2.业务场景
4.5.4 DN的数据平衡
1.各个DN节点的数据平衡
Start balancer daemon.
"$HADOOP_PREFIX"/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script "$bin"/hdfs start balancer $@
[hadoop@pentaKill sbin]$ ./start-balancer.sh
默认是10 threshold = 10.0
[hadoop@pentaKill sbin]$ ./start-balancer.sh -threshold 10.0
每个节点的磁盘使用率-平均的磁盘使用率< 10%
90+60+80=230/3=76%
90-76=14 76-76=0
60-76=-16 78-76=2
80-76=4 76-76=0
生产上 从现在开始,./start-balancer.sh -threshold 10.0
放到业务低谷比如凌晨 去做平衡操作
定时每天的
调整平衡的网络带宽 ,hdfs-site.xml文件
dfs.datanode.balance.bandwidthPerSec 10m-->50m
2.单个DN的多块磁盘的数据均衡
a.在投产前规划 这个DN机器上 10块磁盘 2T 不做raid==》20T
就配置多个磁盘
<property>
<name>dfs.datanode.data.dir</name>
<value>/data01,/data02,/data03</value>
</property>
为什么要用多块物理磁盘 ?
多个磁盘的 IO的 叠加的
1s:30m
只有一个磁盘 3s
性价比最高的 2.5英寸 1W 2T
规划2年存储空间
b.第一个月 1个磁盘 500G 已经使用480G
第二个月 新增一个磁盘 2T
480G--》 /data02 ;软连接 指向/data01
/data01就空了
c. 第一个月 1个磁盘 500G 已经使用480G
第二个月 新增一个磁盘 500G
如何多个磁盘 均衡数据?
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2/
hadoop-2.6.0-cdh5.16.2 dfs.disk.balancer.enabled
apache hadoop 3.x dfs.disk.balancer.enabled
apache hadoop 2.10 找不到
生产上Apache环境 只有3.x版本才支持,但是大部分小伙伴公司是2.x
所以这个特性用不了!!!
<property>
<name>dfs.disk.balancer.enabled</name>
<value>true</value>
</property>
20/05/12 23:25:48 INFO command.Command: No plan generated.
DiskBalancing not needed for node: pentaKill threshold used: 10.0
hdfs diskbalancer -plan pentaKill
生成 pentaKill.plan.json 文件
hdfs diskbalancer -execute pentaKill.plan.json 执行
hdfs diskbalancer -query pentaKill