集群间数据迁移工具distcp

最新推荐文章于 2023-12-07 17:28:04 发布

KLordy

最新推荐文章于 2023-12-07 17:28:04 发布

阅读量1k

点赞数 1

分类专栏： Hadoop Shell

本文链接：https://blog.csdn.net/klordy_123/article/details/81844394

版权

Shell 同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

Hadoop

5 篇文章 0 订阅

订阅专栏

最近公司集群切换，需要将老集群中的数据迁移到新集群上，了解到了distcp这个工具。借助它很好的完成了迁移任务。
基础用法如下：

    hadoop distcp hdfs://cluster1:9000/stat hdfs://cluster2:9000/

这里就是把cluster1中的stat拷贝到cluster2中，这里需要注意源路径和目标路径均要写绝对路径。
另外，如果源路径较多，可以将原路径写到一个文本中，然后通过:

hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination

其中srclist中存的就是多个原路径的绝对路径。
以上只是最简单的用法，具体还有一些相关参数可以进行设置，如：

标识    描述                      备注
-p      [rbugp]                 Preserve
        r:replication number
        b: block size
        u: user
        g: group
        p: permission
-i      忽略失败             这个选项会比默认情况提供关于拷贝的更精确的统计，
-log    <logdir>                记录日志到 <logdir>
-m      <num_maps>      指定了拷贝数据时map的数目。请注意并不是map数越多吞吐量越大。
-overwrite 覆盖目标      如果目标路径有内容会直接覆盖。
-update 如果源和目标的大小不一样则进行覆盖   
-f     <urilist_uri>    用<urilist_uri> 作为源文件列表  这等价于把所有文件名列在命令行中。 urilist_uri 列表应该是完整合法的URI。

由于官网介绍有说道，在应用执行迁移数据过程中，如果有客户端同时在往源文件中写数据，则很有可能会导致应用执行失败，而我们的集群环境中源文件就有可能会在拷贝时有客户端往hdfs上写数据，对此如果拷贝的时候需要自己选择好最小拷贝目录粒度，这样可以防止出现以上问题，我们线上数据是按天来分割的，一般只会修改当天数据，所以我会对某类一级目录下，按天划分为一个个小的任务目录，对每一天的数据执行distcp，从而避免失败。
本人是使用脚本实现的，我的脚本如下：

#!/bin/sh
# 由于distcp在迁移时，如果有客户端在往源路径下写入数据，很可能会导致数据迁移失败，为此要对迁移的数据进行二级目录（时间）级别迁移。
# 入参就是根目录“/”下的文件名  源：hdfs://cluster1:9000/$1/subxxx  目标：hdfs://cluster2:9000/$1
# 脚本在225这台老的NN上执行
orig=$1   #orig=/stat
filter=$2
sourcePath=  #sourcePath=hdfs://cluster1:9000/stat/20180712
destPath=    #destPath=hdfs://cluster2:9000/stat
destRootPath=

#以防拷贝中途出现主备切换的情况，找到active的namenode
getSourcePath(){
    statenn1=`hdfs haadmin -getServiceState nn1`
    statenn2=`hdfs haadmin -getServiceState nn2`
    if [ $statenn1 = "active" ];then
        sourcePath="hdfs://hdfsname:9000$1"
    elif [ $statenn2 = "active" ];then
        sourcePath="hdfs://cluster1:9000$1"
    else
        echo "nn1 state is $statenn1 and nn2 state is $statenn2, no active namenode, please check old hadoop cluster!"
        exit 1
    fi
}

getDestPath(){
    hdfsBin=/home/mmtrix/Application/Gosun/enterprise/hadoop/hadoop-2.6.0-cdh5.4.1/bin/hdfs
    newStatenn1=`ssh xxx.xxx.xxx.xxx $hdfsBin haadmin -getServiceState nn1`
    newStatenn2=`ssh xxx.xxx.xxx.xxx $hdfsBin haadmin -getServiceState nn2`
    if [ $newStatenn1 = "active" ];then
        destPath="hdfs://cluster2:9000$1"
        destRootPath="hdfs://cluster2:9000"
    elif [ $newStatenn2 = "active" ];then
        destPath="hdfs://xxx:9000$1"
        destRootPath="hdfs://xxx:9000"
    else
        echo "nn1 state is $statenn1 and nn2 state is $statenn2, no active namenode, please check new hadoop cluster!"
        exit 1
    fi
}

oldifs="$IFS"
IFS=$'\n'
for line in `hadoop fs -du $orig|grep $filter`
do
    subPath=`echo $line|awk '{print $3}'` #subPath=/stat/20180712
    fileName=`echo $subPath|awk -F "/" '{print $NF}'` #fileName=20180712
    getSourcePath $subPath
    getDestPath $orig
    echo "from $sourcePath to $destPath"

    #-update参数会校验源路径下文件和目标处文件大小是否一样，不一样就会覆盖，
    #一样就会跳过执行，这样如果第二次执行时可以跳过已经拷贝完成的内容 -need test!
    hadoop distcp -m 300 $sourcePath $destPath

    #对迁移完的文件进行校验，比较大小是否一样
    localSize=`hadoop fs -du $orig|grep $fileName`

    remoteSize=`hadoop fs -du $destRootPath$orig|grep $fileName`
    if [ $localSize -eq $remoteSize ];then
        echo "$subPath validate OK!"
    else
        echo "$subPath copy error!"
    fi
done
IFS="$oldifs"