hadoop_hdfs06-hdfs2.x
注:笔记使用.
(一) 集群间的数据拷贝
- scp实现两个主机间的远程拷贝(在配置了ssh的情况下)
scp -r root@hadoop102:/user/user02/inputs/xiaoming.txt root@hadoop103:/user/user02/inputs
- distcp命令实现两个hadoop集群之间的递归数据复制
hadoop distcp hdfs://hadoop102:9000/user/user02/inputs/xiaoming.txt hdfs://hadoop103:9000/user/user02/inputs/xiaoming.txt
(二) 小文件存档
- hdfs小文件存储弊端
每个文件按块存储,每个块的元数据存储在NameNode内存中,大量小文件会消耗内存
- har文件:高效的文档存储工具
1)启动yarn进程
start-yarn.sh
2)归档文件 *
[user02@hadoop102 bin]$ hadoop archive -archiveName xiao.har -p /inputs /user/user02/outputs
3)查看归档文件
[user02@hadoop102 hadoop-2.7.2]$ hadoop fs -ls -R /user/user02/outputs/xiao.har
-rw-r--r-- 3 user02 supergroup 0 0000-00-00 13:53 /user/user02/outputs/xiao.har/_SUCCESS
-rw-r--r-- 5 user02 supergroup 213 0000-00-00 13:53 /user/user02/outputs/xiao.har/_index
-rw-r--r-- 5 user02 supergroup 23 0000-00-00 13:53 /user/user02/outputs/xiao.har/_masterindex
-rw-r--r-- 3 user02 supergroup 0 0000-00-00 13:53 /user/user02/outputs/xiao.har/part-0
[user02@hadoop102 hadoop-2.7.2]$ hadoop fs -ls -R har:///user/user02/outputs/xiao.har
-rw-r--r-- 3 user02 supergroup 0 0000-00-00 13:32 har:///user/user02/outputs/xiao.har/xiaohong.txt
-rw-r--r-- 3 user02 supergroup 0 0000-00-00 13:32 har:///user/user02/outputs/xiao.har/xiaoming.txt
4)解归档文件
[user02@hadoop102 hadoop-2.7.2]$ hadoop fs -cp har:///user/user02/outputs/xiao.har/* /user/user02/outputs/harout
(三)回收站
- 参数说明
-
fs.trash.interval=0默认禁用,其它值表示文件生命周期,单位分钟
-
fs.trash.checkpoint.interval=0 检查回收站间隔时间.值为0表示默认和fs.trash.interval一致
- 配置(停止集群再配置)
user02账号修改core-site.xml,配置垃圾回收时间为1分钟,分发到hadoop103和104
<configuration>
<!--指定hdfs中namenode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop102:9000</value>
</property>
<!--指定hadoop运行时产生文件的存储目录-->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-2.7.2/data/tmp</value>
</property>
<!--配置垃圾回收时间1mins-->
<property>
<name>fs.trash.interval</name>
<value>1</value>
</property>
<!--修改访问垃圾回收站用户名称-->
<property>
<name>hadoop.http.staticuser.user</name>
<value>user02</value>
</property>
</configuration>
- 回收站在集群中的路径
/user/user02/.Trash/.....
- 测试删除一个文件
rm命令相当于mv命令(通过程序删除的文件不会经过回收站,需调用moveToTrash())
[user02@hadoop102 hadoop-2.7.2]$ hdfs dfs -rm /user/user02/inputs/xiaohong.txt
21/07/11 15:27:52 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://hadoop102:9000/user/user02/inputs/xiaohong.txt' to trash at: hdfs://hadoop102:9000/user/user02/.Trash/Current
- 恢复回收站数据
[user02@hadoop102 hadoop-2.7.2]$ hadoop fs -mv /user/user02/.Trash/Current/user/user02/inputs/xiaohong.txt /user/user02/inputs
- 清空回收站
hadoop fs -expunge
(四)快照
对目录做一个备份 相当于mysql binlog
开启
[user02@hadoop104 ~]$ hdfs dfsadmin -allowSnapshot /user/user02/inputs
Allowing snaphot on /user/user02/inputs succeeded
创建
[user02@hadoop104 ~]$ hdfs dfs -createSnapshot /user/user02/inputs
Created snapshot /user/user02/inputs/.snapshot/s20210711-160922.612
比较两个快照的不同之处
[user02@hadoop102 hadoop-2.7.2]$ hadoop dfs -put inputs/xiaohong.txt /user/user02/inputs
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
[user02@hadoop102 hadoop-2.7.2]$ hdfs snapshotDiff /user/user02/inputs/ . .snapshot/s20210711-160922.612
Difference between current directory and snapshot s20210711-160922.612 under directory /user/user02/inputs:
M .
- ./xiaohong.txt