hadoop archive详解

最新推荐文章于 2021-11-11 01:04:29 发布

樱浅沐冰

最新推荐文章于 2021-11-11 01:04:29 发布

阅读量961

点赞数

分类专栏：学习文章标签： hadoop hdfs 大数据

本文链接：https://blog.csdn.net/qq_45300786/article/details/104777750

版权

学习专栏收录该内容

62 篇文章 3 订阅

订阅专栏

前言

hdfs并不擅长存储小文件，因为每个文件最少一个block，每个block的元数据都会在NameNode中占用150byte内存。如果存储大量的小文件，它们会吃掉NameNode节点的大量内存。MR案例：小文件处理方案

Hadoop Archive或者HAR，是一个高效地将小文件放入HDFS块中的文件存档工具。它能将多个小文件打包成一个HAR文件，这样在减少NameNode内存使用的同时，仍然允许对小文件进行透明的访问，比如作为MapReduce的输入。

[root@master ~]# hadoop fs -rm /c23.txt
20/01/27 04:29:38 INFO fs.TrashPolicyDefault: Moved: ‘hdfs://master:8020/c23.txt’ to trash at: hdfs://master:8020/user/root/.Trash/Current/c23.txt
[root@master ~]#

使用方法

归档前的目录

目录small-file下的小文件

[root@master ~]# hadoop fs -ls /1daoyun/small-file
Found 11 items
-rw-r--r--   3 root hdfs         60 2020-01-27 01:22 /1daoyun/small-file/1.txt
-rw-r--r--   3 root hdfs         49 2020-01-27 01:22 /1daoyun/small-file/10.txt
-rw-r--r--   3 root hdfs         54 2020-01-27 01:22 /1daoyun/small-file/11.txt
-rw-r--r--   3 root hdfs         56 2020-01-27 01:22 /1daoyun/small-file/2.txt
-rw-r--r--   3 root hdfs         44 2020-01-27 01:22 /1daoyun/small-file/3.txt
-rw-r--r--   3 root hdfs         53 2020-01-27 01:22 /1daoyun/small-file/4.txt
-rw-r--r--   3 root hdfs         69 2020-01-27 01:22 /1daoyun/small-file/5.txt
-rw-r--r--   3 root hdfs         55 2020-01-27 01:22 /1daoyun/small-file/6.txt
-rw-r--r--   3 root hdfs         59 2020-01-27 01:22 /1daoyun/small-file/7.txt
-rw-r--r--   3 root hdfs         54 2020-01-27 01:22 /1daoyun/small-file/8.txt
-rw-r--r--   3 root hdfs         64 2020-01-27 01:22 /1daoyun/small-file/9.txt

归档命令

可以通过参数 -D har.block.size 指定HAR的大小

shell> hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

-archiveName xiandian.har  : 指定归档后的文件名
-p /1daoyun/small-file : 被归档文件所在的父目录
1.txt 6.txt : 要被归档的目录,一至多个,比如只归档1txt和6.txt),一个都没选，就是默认全选
/user/root : 生成的归档文件存储目录

归档后的目录结构

这里看的是/user/root/xiandian.har目录下的文件

[root@master ~]# hadoop fs -lsr /user/root
lsr: DEPRECATED: Please use 'ls -R' instead.
drwx------   - root hdfs          0 2020-01-27 01:23 /user/root/.staging
drwxr-xr-x   - root hdfs          0 2020-01-27 01:23 /user/root/xiandian.har
-rw-r--r--   3 root hdfs          0 2020-01-27 01:23 /user/root/xiandian.har/_SUCCESS
-rw-r--r--   3 root hdfs        735 2020-01-27 01:23 /user/root/xiandian.har/_index
-rw-r--r--   3 root hdfs         23 2020-01-27 01:23 /user/root/xiandian.har/_masterindex
-rw-r--r--   3 root hdfs        617 2020-01-27 01:23 /user/root/xiandian.har/part-0
[root@master ~]#

查看结果文件【part-0】内容

[root@master ~]# hadoop fs -cat /user/root/xiandian.har/part-0
HTML XHTML CSS JS JQuery   ҳܹ  Web   Ԥ    ̸ Ԥ     HTML XHTML CSS JS JQuery   ҳܹ HTML XHTML CSS JS JQuery   ҳܹL XHTML JQuery   ҳܹ  Web רҵ      ֤ȯ       רҵ      ֤ȯ     רҵ    ֤ȯ       רҵ            Ԥ       ̸oot@master ~]#

1）使用har uri去访问原始数据

HAR是HDFS之上的一个文件系统，因此所有 fs shell 命令对 HAR 文件均可用，只不过文件路径格式不一样

[root@master ~]# hadoop fs -ls har:///user/root/xiandian.har
Found 11 items
-rw-r--r--   3 root hdfs         60 2020-01-27 01:22 har:///user/root/xiandian.har/1.txt
-rw-r--r--   3 root hdfs         49 2020-01-27 01:22 har:///user/root/xiandian.har/10.txt
-rw-r--r--   3 root hdfs         54 2020-01-27 01:22 har:///user/root/xiandian.har/11.txt
-rw-r--r--   3 root hdfs         56 2020-01-27 01:22 har:///user/root/xiandian.har/2.txt
-rw-r--r--   3 root hdfs         44 2020-01-27 01:22 har:///user/root/xiandian.har/3.txt
-rw-r--r--   3 root hdfs         53 2020-01-27 01:22 har:///user/root/xiandian.har/4.txt
-rw-r--r--   3 root hdfs         69 2020-01-27 01:22 har:///user/root/xiandian.har/5.txt
-rw-r--r--   3 root hdfs         55 2020-01-27 01:22 har:///user/root/xiandian.har/6.txt
-rw-r--r--   3 root hdfs         59 2020-01-27 01:22 har:///user/root/xiandian.har/7.txt
-rw-r--r--   3 root hdfs         54 2020-01-27 01:22 har:///user/root/xiandian.har/8.txt
-rw-r--r--   3 root hdfs         64 2020-01-27 01:22 har:///user/root/xiandian.har/9.txt
[root@master ~]#

2）用har uri访问下一级目录

[root@master ~]# hadoop fs -ls har:///user/root/xiandian.har/1.txt
-rw-r--r--   3 root hdfs         60 2020-01-27 01:22 har:///user/root/xiandian.har/1.txt
[root@master ~]#

3）远程访问,可以使用以下命令

[root@ncst ~]# hadoop fs -lsr har://hdfs-ncst:9000/test/in/har/small.har

4）删除har文件必须用rmr命令，rm是不行的

[root@master ~]# hadoop fs -rmr /user/root/xiandian.har

5）使用 HAR 作为 MapReduce 的输入

[root@ncst ~]#  hadoop jar /***/hadoop-mapreduce-examples-2.2.0.jar wordcount \
> har:///test/in/har/0825.har/mapjoin //输入路径
> /test/out/0825/05 //输出路径

存在的问题

存档文件的源文件及目录都不会自动删除，需要手动删除
存档过程实际是一个MapReduce过程，所以需要hadoop的MapReduce支持
存档文件本身不支持压缩
存档文件一旦创建便不可修改，要想从中删除或增加文件，必须重新建立存档文件
创建存档文件会创建原始文件的副本，所以至少需要有与存档文件容量相同的磁盘空间
使用 HAR 作为MR的输入，MR可以访问其中所有的文件。但是由于InputFormat不会意识到这是个归档文件，也就不会有意识的将多个文件划分到单独的Input-Split中，所以依然是按照多个小文件来进行处理，效率依然不高

6）HAR 结构，二级索引

在这里插入图片描述

樱浅沐冰

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
hadoop archive详解

前言hdfs并不擅长存储小文件，因为每个文件最少一个block，每个block的元数据都会在NameNode中占用150byte内存。如果存储大量的小文件，它们会吃掉NameNode节点的大量内存。MR案例：小文件处理方案Hadoop Archive或者HAR，是一个高效地将小文件放入HDFS块中的文件存档工具。它能将多个小文件打包成一个HAR文件，这样在减少NameNode内存使用的同时，仍...
复制链接

扫一扫

专栏目录