第7章 HDFS 2.X新特性

留不住斜阳

已于 2022-06-07 20:51:36 修改

阅读量281

点赞数

分类专栏： HDFS 大数据文章标签： hdfs 小文件存储快照回收站

于 2022-06-07 15:44:55 首次发布

本文链接：https://blog.csdn.net/lubin2016/article/details/125164907

版权

大数据同时被 2 个专栏收录

32 篇文章 1 订阅

订阅专栏

HDFS

8 篇文章 0 订阅

订阅专栏

7.1.集群间数据拷贝

(1) scp实现两个远程主机之间的文件复制

// 推 push
scp -r hello.txt  root@hadoop103:/user/testfiles/hello.txt	
// 拉 pull	
scp -r root@hadoop103:/user/testfiles/hello.txt  hello.txt	
//通过本地主机中转实现两个远程主机的文件复制；如果在两个远程主机之间ssh没有配置的情况下可以使用该方式。	
scp -r root@hadoop103:/user/testfiles/hello.txt root@hadoop104:/user/testfiles

(2) 采用distcp命令实现两个hadoop集群之间的数据复制

$ hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
              OPTIONS
 -append                       Reuse existing data in target files and
                               append new data to them if possible
 -async                        Should distcp execution be blocking
 -atomic                       Commit all changes or none
 -bandwidth <arg>              Specify bandwidth per map in MB
 -blocksperchunk <arg>         If set to a positive value, fileswith more
                               blocks than this value will be split into
                               chunks of <blocksperchunk> blocks to be
                               transferred in parallel, and reassembled on
                               the destination. By default,
                               <blocksperchunk> is 0 and the files will be
                               transmitted in their entirety without
                               splitting. This switch is only applicable
                               when the source file system implements
                               getBlockLocations method and the target
                               file system implements concat method
 -copybuffersize <arg>         Size of the copy buffer to use. By default
                               <copybuffersize> is 8192B.
 -delete                       Delete from target, files missing in source
 -diff <arg>                   Use snapshot diff report to identify the
                               difference between source and target
 -f <arg>                      List of files that need to be copied
 -filelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n
 -filters <arg>                The path to a file containing a list of
                               strings for paths to be excluded from the
                               copy.
 -i                            Ignore failures during copy
 -log <arg>                    Folder on DFS where distcp execution logs
                               are saved
 -m <arg>                      Max number of concurrent maps to use for
                               copy
 -mapredSslConf <arg>          Configuration for ssl config file, to use
                               with hftps://. Must be in the classpath.
 -numListstatusThreads <arg>   Number of threads to use for building file
                               listing (max 40).
 -overwrite                    Choose to overwrite target files
                               unconditionally, even if they exist.
 -p <arg>                      preserve status (rbugpcaxt)(replication,
                               block-size, user, group, permission,
                               checksum-type, ACL, XATTR, timestamps). If
                               -p is specified with no <arg>, then
                               preserves replication, block size, user,
                               group, permission, checksum type and
                               timestamps. raw.* xattrs are preserved when
                               both the source and destination paths are
                               in the /.reserved/raw hierarchy (HDFS
                               only). raw.* xattrpreservation is
                               independent of the -p flag. Refer to the
                               DistCp documentation for more details.
 -rdiff <arg>                  Use target snapshot diff report to identify
                               changes made on target
 -sizelimit <arg>              (Deprecated!) Limit number of files copied
                               to <= n bytes
 -skipcrccheck                 Whether to skip CRC checks between source
                               and target paths.
 -strategy <arg>               Copy strategy to use. Default is dividing
                               work based on file sizes
 -tmp <arg>                    Intermediate work path to be used for
                               atomic commit
 -update                       Update target, copying only missingfiles or
                               Directories

示例

# 远程传输hadoop102上的文件到hadoop103上
hadoop distcp hdfs://haoop102:9000/user/testfiles/hello.txt hdfs://hadoop103:9000/user/testfiles/hello.txt

7.2.HDFS小文件存储方案

每个文件按块存储，每个块的元数据存储在NameNode内存中，因此hadoop存储小文件会非常低效。因为大量的小文件会耗尽NameNode大部分内存。但存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相同。例如，一个1MB文件以大小为128MB的块存储，使用的是1MB的磁盘空间，而不是128MB。

7.2.1.HAR存储方案

HAR简称Hadoop归档文件，文件以*.har结尾。归档就是将多个小文件归档为一个文件，归档文件中包含元数据信息和小文件内容，即从一定程度上将NameNode管理的元数据信息存储到Datanode上的归档文件中，避免元数据的膨胀。
在这里插入图片描述

图中，左边是原始小文件，右边是har组成。主要包括：_masterindex、_index、part-0 ,…, part-n、_SUCCESS。其中_masterindex和_index就是相应的元数据信息，part-0, …, part-n就是相应的小文件内容。

例如，在集群中有如下存储结构：

[root@lubin01 hadoop-3.3.1]# hdfs dfs -ls /test
Found 3 items
drwxr-xr-x   - root supergroup          0 2022-06-07 16:31 /test/files
-rw-r--r--   3 root supergroup         35 2021-09-28 11:19 /test/test.txt
-rw-r--r--   3 root supergroup         14 2021-12-16 15:48 /test/test2.txt

通过hadoop archive命令创建归档文件

hadoop archive <-archiveName <NAME>.har> <-p <parent path>> [-r <replication factor>] <src>* <dest>

-archiveName指定归档文件名称, -p指定原文件父目录，-r指定归档文件的副本因子，如下所示

hadoop archive -archiveName test.har -p /test/files -r 3 test.txt test2.txt /test/files

上述命令会在/test/files上生成log.har目录，归档文件是一个逻辑概念，实际上har是一个目录，一个物理存储概念。这个目录会存储元数据和实际文件内容。

查看归档文件
方式一：

hadoop fs -ls har://scheme-hostname:port/archivepath/fileinarchive

scheme-hostname格式为hdfs-域名:端口，如果没有提供scheme-hostname，它会使用默认的文件系统。

示例如下

[root@lubin01 hadoop-3.3.1]# hadoop fs -ls har://hdfs-lubin01:8020/test/files/test.har
Found 2 items
-rw-r--r--   3 root supergroup         35 2021-09-28 11:19 har://hdfs-lubin01:8020/test/files/test.har/test.txt
-rw-r--r--   3 root supergroup         14 2021-12-16 15:48 har://hdfs-lubin01:8020/test/files/test.har/test2.txt

如果用har协议的uri去访问的话，索引、标识等文件就会隐藏起来，只显示创建档案之前的原文件

查看具体文件内容

[root@lubin01 hadoop-3.3.1]# hadoop fs -cat har://hdfs-lubin01:8020/test/files/test.har/test.txt
spark hello
world java
linux spark

方式二：

hdfs dfs -ls /test/files/test.har

如下所示

Found 4 items
-rw-r--r--   3 root supergroup          0 2022-06-07 16:31 /test/files/test.har/_SUCCESS
-rw-r--r--   3 root supergroup        196 2022-06-07 16:31 /test/files/test.har/_index
-rw-r--r--   3 root supergroup         23 2022-06-07 16:31 /test/files/test.har/_masterindex
-rw-r--r--   3 root supergroup         49 2022-06-07 16:31 /test/files/test.har/part-0

_index文件的每一行表示小文件在part文件的位置映射关系，包括起始位置和结束位置，以及在哪个part文件，这样可以在读取har中小文件时，根据offset位置可以直接得到小文件内容，

[root@lubin01 hadoop-3.3.1]# hdfs dfs -cat /test/files/test.har/_index
%2F dir 1654590446753+493+root+supergroup 0 0 test.txt test2.txt 
%2Ftest2.txt file part-0 35 14 1639640918227+420+root+supergroup 
%2Ftest.txt file part-0 0 35 1632799168349+420+root+supergroup

[root@lubin01 hadoop-3.3.1]# hdfs dfs -cat /test/files/test.har/part-0
spark hello
world java
linux spark
hahaha
ffffff

解压archive
按顺序解压存档（串行）：

hadoop fs -cp har:///test/files/test.har/test.txt /test/test-archive

distcp方式并行解压har文件，其原理也是MapReduce，指定har路径和输出路径，命令如下：

hadoop distcp har:///test/files/test.har/* /test/test-archive

查看结果如下

[root@lubin01 hadoop-3.3.1]# hdfs dfs -ls /test/test-archive
Found 2 items
-rw-r--r--   3 root supergroup         35 2022-06-07 17:09 /test/test-archive/test.txt
-rw-r--r--   3 root supergroup         14 2022-06-07 17:09 /test/test-archive/test2.txt

HAR缺点

archive文件一旦创建不可修改即不能append，如果其中某个小文件有问题，得解压处理完异常文件后重新生成新的archive文件;
对小文件归档后，原文件并未删除，需要手工删除;
创建HAR和解压HAR依赖MapReduce，查询文件时耗很高;
归档文件不支持压缩。
创建archive文件要消耗和原文件一样多的硬盘空间

7.2.2.Sequence存储方案

使用序列文件（SequenceFile）解决小文件存储问题。这种方法的思路是，使用文件名（filename）作为key，并且文件内容（file contents）作为value，如下图。
在这里插入图片描述

在实践中这种方式非常有效。回到10,000个100KB小文件问题上，可以编写一个程序将它们放入一个单一的SequenceFile，然后可以流式处理它们（直接处理或使用MapReduce）操作SequenceFile。这样同时会带来两个优势：（1）SequenceFiles是可拆分的，因此MapReduce可以将它们分成块并独立地对每个块进行操作；（2）它们同时支持压缩，不像HAR。在大多数情况下，块压缩是最好的选择，因为它将压缩几个记录为一个块，而不是一个记录压缩一个块。

SequenceFile文件内容由一个Header、一个或多个Record/Block、一个或多个SYNC标记组成，根据压缩的方式不同，组织结构也不同，主要分为Record组织模式和Block组织模式。

7.2.2.1.Record组织模式

在SequenceFile文件中，每一个key-value被看做是一条记录(Record)，因此基于Record的压缩策略，SequenceFile文件可支持三种压缩类型(SequenceFile.CompressionType):

CompressionType.NONE: 对record不进行压缩
CompressionType.RECORD: 仅压缩每一个record中的value值
其逻辑结构如下：

Record结构中包含Record长度、key长度、key值和value值。Sync充斥在Record之间，其作用主要是用于文件位置定位，具体定位方式是：如果提供的文件读取位置不是记录的边界，可能在一个Record中间，在实际定位时会定位到所提供位置处之后的第一个Sync边界位置，并从该Sync点往后读相应长度的数据，如果提供的读取位置往后没有Sync边界点，则直接跳转文件末尾；如果提供的文件读取位置是Record边界，则直接从该位置开始读取指定长度的数据。另一种文件定位方式是seek, 这种方式则要求所提供的读取位置是record的边界位置，不然在迭代读取下一个位置时会出错。

7.2.2.2.Block组织模式

压缩态为CompressionType.BLOCK。与Record模式不同，Block是以块为单位进行压缩，即将多条Record写到一个块中，当达到一定大小时，对该块进行压缩，很显然，块的压缩效率会比Record要高很多，避免大量消费IO和CPU等资源。其逻辑结构如下：
在这里插入图片描述

从上图中可看出，组织方式变成了块，一个块中又包含了块的记录数，key长度，key值，value长度，value值。每个块之间也有Sync标记，作用同Record方式。

两种模式中，都有header标记，包含了些如版本信息、KEY类名、VALUE类名、是否压缩、是否块压缩、编码类、元数据信息和Sync标记，其结构如下：
在这里插入图片描述

SequenceFile优缺点
优点：

支持记录或块的数据压缩；
支持splitable，能够作为mr 的输入分片；
不用考虑具体存储格式，写入读取较简单；

缺点：

需要一个合并文件的过程；
依赖于MapReduce；
二进制文件，合并后不方便查看；

7.2.3.CombinedFile存储方案

其原理是基于Map/Reduce将原文件进行转换，通过CombineFileInputFormat类将多个文件分别打包到一个split中，每个mapper处理一个split, 提高并发处理效率，对于有大量小文件的场景，通过这种方式能快速将小文件进行整合。最终的合并文件是将多个小文件内容整合到一个文件中，每一行开始包含每个小文件的完整hdfs路径名，这就会出现一个问题，如果要合并的小文件很多，那么最终合并的文件会包含过多的额外信息，浪费过多的空间，所以这种方案目前相对用得比较少

其优点是适用于处理大量比block小的文件和内容比较少的文件合并，尤其是文本类型/sequencefile等文件合并，其缺点是：如果没有合理的设置maxSplitSize，minSizeNode，minSizeRack，则可能会导致一个map任务需要大量访问非本地的Block造成网络开销，反而比正常的非合并方式更慢。

7.3.快照管理

快照相当于对目录做一个备份。并不会立即复制所有文件，而是指向同一个文件。当写入发生时，才会产生新文件。

基本语法
(1) 开启指定目录的快照功能

hdfs dfsadmin -allowSnapshot  HDFS路径

(2) 禁用指定目录的快照功能，默认是禁用

hdfs dfsadmin -disallowSnapshot  HDFS路径

(3) 对目录创建快照

hdfs dfs -createSnapshot 路径

(4) 指定名称创建快照

hdfs dfs -createSnapshot 路径 名称

(5) 重命名快照

hdfs dfs -renameSnapshot 路径 旧名称 新名称

(6) 列出当前用户所有可快照目录

hdfs lsSnapshottableDir

(7) 比较两个快照之间的不同之处

hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot>

(8) 删除快照

hdfs dfs -deleteSnapshot <path> <snapshotName>

案例实操
(1) 开启/禁用指定目录的快照功能

hdfs dfsadmin -allowSnapshot /teaching/hdfs3
hdfs dfsadmin -disallowSnapshot /teaching/hdfs3

(2)对目录创建快照

hdfs dfs -createSnapshot /teaching/hdfs3

通过web访问http://xxxxxx:50070/explorer.html#/teaching/hdfs3/.snapshot/s20190815-113957.335, 快照和源文件使用相同数据块

$ hdfs dfs -lsr /teaching/hdfs3/.snapshot/
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - hdfs supergroup          0 2019-08-15 11:39 /teaching/hdfs3/.snapshot/s20190815-113957.335
-rw-r--r--   3 hdfs supergroup         15 2019-08-15 11:36 /teaching/hdfs3/.snapshot/s20190815-113957.335/log1.txt

(3) 指定名称创建快照

$ hdfs dfs -createSnapshot /teaching/hdfs3 teaching-test
Created snapshot /teaching/hdfs3/.snapshot/teaching-test

(4) 重命名快照

$ hdfs dfs -renameSnapshot /teaching/hdfs3/ teaching-test teaching-test2
$ hdfs dfs -lsr /teaching/hdfs3/.snapshot
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - hdfs supergroup          0 2019-08-15 11:50 /teaching/hdfs3/.snapshot/teaching-test2
-rw-r--r--   3 hdfs supergroup         15 2019-08-15 11:36 /teaching/hdfs3/.snapshot/teaching-test2/log1.txt

(5) 列出当前用户所有可快照目录

$ hdfs lsSnapshottableDir
drwxr-xr-x 0 hdfs supergroup 0 2019-08-15 11:50 2 65536 /teaching/hdfs3

(6) 拷贝快照

$ hdfs dfs -cp /teaching/hdfs3/.snapshot/teaching-test2 /teaching/hdfs3
$ hdfs dfs -ls /teaching/hdfs3
Found 2 items
-rw-r--r--  3 hdfs supergroup  15 2019-08-15 11:36 /teaching/hdfs3/log1.txt
drwxr-xr-x  - hdfs supergroup            0 2019-08-15 14:14 /teaching/hdfs3/teaching-test2

7.4.回收站

(1) 默认回收站

fs.trash.interval以分钟为单位的垃圾回收时间，垃圾站数据超过此时间，会被删除。默认值为0，垃圾回收机制关闭。可以配置在服务器端和客户端。如果在服务器端配置trash无效，会检查客户端配置。如果服务器端配置有效，客户端配置会忽略。
fs.trash.checkpoint.interval以分钟为单位的垃圾回收检查间隔。应该小于或等于fs.trash.interval。默认值为0，值等同于fs.trash.interval。每次检查器运行，会创建新的检查点。

(2) 启用回收站
修改core-site.xml，配置垃圾回收时间为1分钟。

<property>
    <name>fs.trash.interval</name>
    <value>1</value>
</property>

(3) 查看回收站
回收站在集群中的路径：/user/$USER/.Trash/
注意：HDFS上回收站数据在/user/$USER/.Trash/Current目录下，如果检查点已经启用，会定期使用时间戳重命名Current目录。.Trash中的文件在用户可配置的时间到达后被永久删除。

(4) 修改访问垃圾回收站用户名称
进入垃圾回收站用户名称，默认是dr.who，修改为lubin用户
修改文件core-site.xml

<property>
  <name>hadoop.http.staticuser.user</name>
  <value>lubin</value>
</property>

(5) 通过程序删除的文件不会经过回收站，需要调用moveToTrash()才进入回收站

Trash trash = new Trash(conf);
trash.moveToTrash(path);

(6) 恢复回收站数据

hadoop fs -mv /user/hdfs/.Trash/Current/teaching/hdfs/test3.txt /teaching/hdfs/test3.txt

或

hdfs dfs -mv /user/hdfs/.Trash/Current/teaching/hdfs/test3.txt /teaching/hdfs/test3.txt

注意：恢复回收站数据后，回收站中的数据会删除
(7) 清空回收站

hadoop fs -expunge

或

$ hdfs dfs -expunge
19/08/15 14:55:52 INFO fs.TrashPolicyDefault: TrashPolicyDefault#deleteCheckpoint for trashRoot: hdfs://rc-fhcb-09-hd001:8020/user/hdfs/.Trash
19/08/15 14:55:52 INFO fs.TrashPolicyDefault: TrashPolicyDefault#deleteCheckpoint for trashRoot: hdfs://rc-fhcb-09-hd001:8020/user/hdfs/.Trash
19/08/15 14:55:52 INFO fs.TrashPolicyDefault: TrashPolicyDefault#createCheckpoint for trashRoot: hdfs://rc-fhcb-09-hd001:8020/user/hdfs/.Trash
19/08/15 14:55:52 INFO fs.TrashPolicyDefault: Created trash checkpoint: /user/hdfs/.Trash/190815145552