黑猴子的家:Hadoop 集群间数据拷贝

集群间数据拷贝,可以使用scp rsync distcp等等 方法,再次我只介绍一下distcp,scp和rsync在linux章节已有介绍,就不多说了

1、网址

http://hadoop.apache.org/docs/r2.8.2/hadoop-distcp/DistCp.html

2、概述

DistCp版本2(分布式副本)是用于大型集群间/集群内复制的工具。它使用MapReduce来实现分布,错误处理和恢复以及报告。它将文件和目录列表扩展为映射任务的输入,每个任务都将复制源列表中指定文件的一个分区。

以前的DistCp的实现在它的使用以及它的可扩展性和性能方面都有它的一些怪癖和缺点。DistCp重构的目的是解决这些缺点,使其能够以编程方式使用和扩展。引入了新的范例来提高运行时间和设置性能,同时保留默认的传统行为。

本文档旨在描述新的DistCp的设计,其崭新功能,最佳使用以及与传统实施的偏差。

3、帮助命令

[victor@node1 hadoop]$ hadoop distcp
usage: distcp OPTIONS [source_path...] <target_path>
 OPTIONS
 -append Reuse existing data in target files and
 append new data to them if possible
 -async Should distcp execution be blocking
 -atomic Commit all changes or none
 -bandwidth <arg> Specify bandwidth per map in MB
 -delete Delete from target, files missing in source
 -diff <arg> Use snapshot diff report to identify the
 difference between source and target
 -f <arg> List of files that need to be copied
 -filelimit <arg> (Deprecated!) Limit number of files copied
 to <= n
 -filters <arg> The path to a file containing a list of
 strings for paths to be excluded from the
 copy.
 -i Ignore failures during copy
 -log <arg> Folder on DFS where distcp execution logs
 are saved
 -m <arg>      Max number of concurrent maps to use for
 copy
 -mapredSslConf <arg> Configuration for ssl config file, to use
 with hftps://. Must be in the classpath.
 -numListstatusThreads <arg>   Number of threads to use for building file
 listing (max 40).
 -overwrite Choose to overwrite target files
 unconditionally, even if they exist.
 -p <arg>  preserve status (rbugpcaxt)(replication,
 block-size, user, group, permission,
 checksum-type, ACL, XATTR, timestamps). If
 -p is specified with no <arg>, then
 preserves replication, block size, user,
 group, permission, checksum type and
 timestamps. raw.* xattrs are preserved when
 both the source and destination paths are
 in the /.reserved/raw hierarchy (HDFS
 only). raw.* xattrpreservation is
 independent of the -p flag. Refer to the
 DistCp documentation for more details.
 -sizelimit <arg> (Deprecated!) Limit number of files copied
 to <= n bytes
 -skipcrccheck Whether to skip CRC checks between source
 and target paths.
 -strategy <arg> Copy strategy to use. Default is dividing
 work based on file sizes
 -tmp <arg> Intermediate work path to be used for
 atomic commit
 -update Update target, copying only missingfiles or
 directories

4、使用方法

[victor@node1 hadoop]$ bin/hadoop distcp hdfs://nn1:9000/foo/bar hdfs://nn6:9000/bar/foo

把nn1集群的/foo/bar目录下的所有文件或目录名展开并存储到nn6集群中,这些文件内容的拷贝工作被分配给多个map任务, 然后每个TaskTracker分别执行从nn1到nn2的拷贝操作,注意DistCp使用绝对路径进行操作。

5、命令行中可以指定多个源目录

[victor@node1 hadoop]$ bin/hadoop distcp \
hdfs://nn1:9000/foo/a \
hdfs://nn1:9000/foo/b \
hdfs://nn6:9000/bar/foo

6、使用-f选项,从文件里获得多个源

[victor@node1 hadoop]$ hadoop distcp -f \
hdfs://nn1:9000/srclist \
hdfs://nn6:9000/bar/foo 

//尖叫提示:srclist的内容是 hdfs://nn1:9000/foo/a 和hdfs://nn1:9000/foo/b
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值