Hadoop DistCp 命令

一、概述

Distcp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具,使用Map/Reduce实现文件的分发、错误处理和恢复,以及生成相应的报告。要拷贝的文件和目录列表会作为map任务的输入,每个map任务处理部分文件的拷贝任务。

二、使用方法

集群间的拷贝:

$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo    #指定单个源目录

上面命令会将nn1集群的/foo/bar目录下的所有文件拷贝到nn2集群的/bar/foo目录下。具体做法是将要拷贝的文件或目录名展开并存储到一个临时文件中,然后分配给多个map任务,每个TaskTracker分别执行从nn1到nn2的拷贝操作。

$ hadoop distcp hdfs://nn1:8020:/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo    # 指定多个源目录

或者指定一个文件,将路径写入文件中

$ hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/bar/foo

注:参数 -f 选项用来指定某个文件,srclist内容为

hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b

三、注意事项:

  • 拷贝冲突

        当从多个源拷贝时,如果两个源冲突,distcp会停止拷贝病输出错误信息;

        如果目的位置发生冲突,可根据参数选项解决;

  • 两个集群要版本相同或者两者的通信协议兼容

  • 当源文件正在被写入时,拷贝可能会失败

  • 当尝试覆盖hdfs上正在被写入的文件时,拷贝可能会失败

  • 源文件不存在或输出异常FileNotFoundException

四、选项:

Flag

Description

Notes

-p[rbugpcaxt]

Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp

When -update is specified, status updates will not be synchronized unless the file sizes also differ (i.e. unless the file is re-created). If -pa is specified, DistCp preserves the permissions also because ACLs are a super-set of permissions.

-i

Ignore failures

As explained in the Appendix, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.

-log <logdir>

Write logs to <logdir>

DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.

-v

Log additional info (path, size) in the SKIP/COPY log

This option can only be used with -log option.

-m <num_maps>

Maximum number of simultaneous copies

Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.

-overwrite

Overwrite destination

If a map fails and -i is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.

-update

Overwrite if source and destination differ in size, blocksize, or checksum

As noted in the preceding, this is not a “sync” operation. The criteria examined are the source and destination file sizes, blocksizes, and checksums; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.

-append

Incremental copy of file with same name but different length

If the source file is greater in length than the destination file, the checksum of the common length part is compared. If the checksum matches, only the difference is copied using read and append functionalities. The -append option only works with -update without -skipcrccheck

-f <urilist_uri>

Use list at <urilist_uri> as src list

This is equivalent to listing each source on the command line. The urilist_uri list should be a fully qualified URI.

-filters

The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy.

Support regular expressions specified by java.util.regex.Pattern.

-filelimit <n>

Limit the total number of files to be <= n

Deprecated! Ignored in the new DistCp.

-sizelimit <n>

Limit the total size to be <= n bytes

Deprecated! Ignored in the new DistCp.

-delete

Delete the files existing in the dst but not in src

The deletion is done by FS Shell. So the trash will be used, if it is enable. Delete is applicable only with update or overwrite options.

-strategy {dynamic|uniformsize}

Choose the copy-strategy to be used in DistCp.

By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If “dynamic” is specified, DynamicInputFormat is used instead. (This is described in the Architecture section, under InputFormats.)

-bandwidth

Specify bandwidth per map, in MB/second.

Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the net bandwidth used tends towards the specified value.

-atomic {-tmp <tmp_dir>}

Specify atomic commit, with optional tmp directory.

-atomic instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, -tmp may be used to specify the location of the tmp-target. If not specified, a default is chosen. Note: tmp_dir must be on the final target cluster.

-mapredSslConf <ssl_conf_file>

Specify SSL Config file, to be used with HSFTP source

When using the hsftp protocol with a source, the security- related properties may be specified in a config-file and passed to DistCp. <ssl_conf_file> needs to be in the classpath.

-async

Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched.

The Hadoop Job-id is logged, for tracking.

-diff <oldSnapshot> <newSnapshot>

Use snapshot diff report between given two snapshots to identify the difference between source and target, and apply the diff to the target to make it in sync with source.

This option is valid only with -update option and the following conditions should be satisfied.

  1. Both the source and the target FileSystem must be DistributedFileSystem.

  2. Two snapshots <oldSnapshot> and <newSnapshot> have been created on the source FS, and <oldSnapshot> is older than <newSnapshot>.

  3. The target has the same snapshot <oldSnapshot>. No changes have been made on the target since <oldSnapshot> was created, thus <oldSnapshot> has the same content as the current state of the target. All the files/directories in the target are the same with source’s <oldSnapshot>.

-rdiff <newSnapshot> <oldSnapshot>

Use snapshot diff report between given two snapshots to identify what has been changed on the target since the snapshot <oldSnapshot> was created on the target, and apply the diff reversely to the target, and copy modified files from the source’s <oldSnapshot>, to make the target the same as <oldSnapshot>.

This option is valid only with -update option and the following conditions should be satisfied.

  1. Both the source and the target FileSystem must be DistributedFileSystem. The source and the target can be two different clusters/paths, or they can be exactly the same cluster/path. In the latter case, modified files are copied from target’s <oldSnapshot> to target’s current state).

  2. Two snapshots <newSnapshot> and <oldSnapshot> have been created on the target FS, and <oldSnapshot> is older than <newSnapshot>. No change has been made on target since <newSnapshot> was created on the target.

  3. The source has the same snapshot <oldSnapshot>, which has the same content as the <oldSnapshot> on the target. All the files/directories in the target’s <oldSnapshot> are the same with source’s <oldSnapshot>.

-numListstatusThreads

Number of threads to use for building file listing

At most 40 threads.

-skipcrccheck

Whether to skip CRC checks between source and target paths.

 

-blocksperchunk <blocksperchunk>

Number of blocks per chunk. When specified, split files into chunks to copy in parallel

If set to a positive value, files with more blocks than this value will be split into chunks of <blocksperchunk> blocks to be transferred in parallel, and reassembled on the destination. By default, <blocksperchunk> is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when the source file system implements getBlockLocations method and the target file system implements concat method.

-copybuffersize <copybuffersize>

Size of the copy buffer to use. By default, <copybuffersize> is set to 8192B

 

 

  • -update & -overwrite

-update:当目标集群上的文件不存在或文件不一致时,才会从源集群拷贝

-overwrite:覆盖目标集群上的文件

提醒:在使用-update和-overwrite选项时应特别注意,因为两者在默认情况下对源路径的变化处理很敏感。考虑如下情况:

源路径:

hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20

当不使用参数-update和-overwrite时,distcp会在目标路径target目录下创建first/和second/目录

$ hadoop distcp hdfs://nn1:8020/first hdfs://nn1:8020/second hdfs://nn2:8020/target

目标路径:

hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20

当使用-update参数时,distcp会拷贝源路径的内容到目标路径,而不是源路径本身

$ hadoop distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20

扩展:当first和second目录下包含相同名称的文件时(如都包含0),则两个源文件都会对应到目标路径的同一个文件,distcp不允许存在这样的冲突,会终止拷贝。

  • -filters

Distcp默认情况下会拷贝源路径下的所有文件,但是有时候我们并不想全部拷贝,这是就可以使用-filters参数来排除制定路径和文件不做拷贝。

看一下选项说明:

-filters

The path to a file containing a list of pattern strings, one string per line, such that paths matching the pattern will be excluded from the copy.

Support regular expressions specified by java.util.regex.Pattern.

很明显,只要将你想要跳过的路径名写入到一个文件中,使用-filters指定该文件即可,同时路径名支持正则表达式。

-filter排除文件的源码:Matcher.matches这个函数只有当正则完整匹配整个文件路径时才返回true,其他情况都返回false表示不匹配

@Override
public boolean shouldCopy(Path path) {
  for (Pattern filter : filters) {
    if (filter.matcher(path.toString()).matches()) {
      return false;
    }
  }
  return true;
}

如想要跳过包含“source”字符串的所有路径及文件,则正则表达式如下:

.*source.*

这样就会匹配hdfs://source/path路径下所有包含source字符串的路径,然后跳过这些路径不做拷贝

$ hadoop distcp -filters /path/to/filterfile.txt hdfs://source/path hdfs://destination/path

当然也可以写绝对路径。

注意:filterfile.txt文件应该放在本地,而不是集群上

小结:

举例:要跳过源集群上包含test

.*test.*                         # 正则表达式法
/source/test                       # 在源集群上
hdfs://0.0.0.0:8020/source/test    # 在目标集群上,需要写全路径

参考:

[1]  https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

[2]  https://www.ericlin.me/2016/01/how-to-use-filters-to-exclude-files-when-in-distcp/

[3]  https://sapser.github.io/bigdata/2016/09/30/distcp-filters-usage

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值