distcp迁移实操_Hadoop跨集群迁移数据(整理版)

1. 什么是DistCp

DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。它使用Map/Reduce实现文件分发,错误处理和恢复,以及报告生成。它把文件和目录的列表作为map任务的输入,每个任务会完成源列表中部分文件的拷贝。由于使用了Map/Reduce方法,这个工具在语义和执行上都会有特殊的地方。

1.1 DistCp使用的注意事项

1. DistCp会尝试着均分需要拷贝的内容,这样每个map拷贝差不多相等大小的内容。但因为文件是最小的拷贝粒度,所以配置增加同时拷贝(如map)的数目不一定会增加实际同时拷贝的数目以及总吞吐量。

2. 如果没使用-m选项,DistCp会尝试在调度工作时指定map的数据为 min (total_bytes / bytes.per.map, 20 * num_task_trackers),其中bytes.per.map默认是256MB。

3. 建议对于长时间运行或定期运行的作业,根据源和目标集群大小、拷贝数量大小以及带宽调整map的数目。

4. 对于不同Hadoop版本间的拷贝,用户应该使用HftpFileSystem。这是一个只读文件系统,所以DistCp必须运行在目标端集群上(更确切的的说是能够写入目标集群的TaskTracker上)。源的格式是 hftp:/// (默认情况dfs.http.address是 :50070)。

2. Hadoop DistCp的api使用

[root@node105 ~]# hadoop distcp

usage: distcp OPTIONS [source_path...]OPTIONS-append Reuse existing data intarget files and

append new data to themifpossible-async Should distcp execution be blocking-atomic Commit all changes or none-bandwidth Specify bandwidth per map inMB-blocksperchunk If set to a positive value, fileswith moreblocks than this value will besplitinto

chunks ofblocks to be

transferredinparallel, and reassembled on

the destination. By default, is 0and the files will be

transmittedintheir entirety without

splitting. This switch is only applicable

when the sourcefilesystem implements

getBlockLocations method and the targetfilesystem implements concat method-copybuffersize Size of the copy buffer to use. By defaultis 8192B.-delete Delete from target, files missing insource-diff Use snapshot diffreport to identify the

difference between source and target-f List of files that need to be copied-filelimit (Deprecated!) Limit number of files copied

to<=n-filters The path to a filecontaining a list of

stringsforpaths to be excluded from the

copy.-i Ignore failures during copy-log Folder on DFS where distcp execution logs

are saved-m Max number of concurrent maps to use forcopy-mapredSslConf Configuration for ssl config file, to use

with hftps://. Must be in the classpath.

-numListstatusThreads Number of threads to use for building filelisting (max40).-overwrite Choose to overwrite target files

unconditionally, evenifthey exist.-p preserve status (rbugpcaxt)(replication,

block-size, user, group, permission,

checksum-type, ACL, XATTR, timestamps). If-p is specified with no , thenpreserves replication, block size, user,

group, permission, checksum type and

timestamps. raw.*xattrs are preserved when

both the source and destination paths arein the /.reserved/raw hierarchy (HDFS

only). raw.*xattrpreservation is

independent of the-p flag. Refer to the

DistCp documentationfor moredetails.-rdiff Use target snapshot diffreport to identify

changes made on target-sizelimit (Deprecated!) Limit number of files copied

to<=n bytes-skipcrccheck Whether to skip CRC checks between source

and target paths.-strategy Copy strategy to use. Default is dividing

work based onfilesizes-tmp Intermediate work path to be used foratomic commit-update Update target, copying only missingfiles or

directories

3. 测试用例

1. 查看将要迁移的目标文件

[root@calculation101 ~]# hdfs dfs -du -h /test/2018/10/

2. 创建新集群的测试目录:

[hdfs@node105 root]$

[hdfs@node105 root]$ hdfs dfs-mkdir -p /yangjianqiu/data/[hdfs@node105 root]$

[hdfs@node105 root]$ hdfs dfs-chown -R root:root /yangjianqiu/data/[hdfs@node105 root]$

[hdfs@node105 root]$ exit

exit

[root@node105~]#

[root@node105~]# hdfs dfs -ls /yangjianqiu

Found1items

drwxr-xr-x - root root 0 2018-10-29 03:29 /yangjianqiu/data

2. 开始迁移数据I并记录日志以及迁移数据所用时间:

[root@node105 ~]# mkdir /yangjianqiu

[root@node105~]#

[root@node105~]#

[root@node105~]# nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &[1]11125[root@node105~]#

[root@node105~]# jobs

[1]+ Running nohuptimehadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &

4. 应用程序调用distcp接口

总结

【参考资料】

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值