distcp一般用于在两个HDFS集群中传输数据。如果集群在hadoop的同一版本上运行,就适合使用hdfs方案:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Hadoop distcp命令用于在两个不同的集群间拷贝数据,它的优点在于将数据拷贝操作作为mapred程序来执行,这样就大大提高了拷贝的速度,使用distcp命令时必须注意以下事项:
1)数据源集群 的所有节点必须 知道目标集群所有节点ip和host的转换关系
2)目标路径必须存在
3)命令中必须使用主机名,而不是ip地址
4)可以指定多个源路径,并且所有的都会被复制到目标路径;源路径必须是绝对路径
5)默认情况下,distcp会跳过目标路径已经有的文件,但可以通过提供的-overwrite选项进行覆盖,也可以用-update选项来选择只更新那些修改过的文件
6)如果拷贝单表,可以进到Hive里通过show create table tab_name查看对应表的文件位置信息
OPTIONS:
操作代码及记录:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
Hadoop distcp命令用于在两个不同的集群间拷贝数据,它的优点在于将数据拷贝操作作为mapred程序来执行,这样就大大提高了拷贝的速度,使用distcp命令时必须注意以下事项:
1)数据源集群 的所有节点必须 知道目标集群所有节点ip和host的转换关系
2)目标路径必须存在
3)命令中必须使用主机名,而不是ip地址
4)可以指定多个源路径,并且所有的都会被复制到目标路径;源路径必须是绝对路径
5)默认情况下,distcp会跳过目标路径已经有的文件,但可以通过提供的-overwrite选项进行覆盖,也可以用-update选项来选择只更新那些修改过的文件
6)如果拷贝单表,可以进到Hive里通过show create table tab_name查看对应表的文件位置信息
OPTIONS:
-p[rbugp] Preserve status
r: replication number
b: block size
u: user
g: group
p: permission
-p alone is equivalent to -prbugp
-i Ignore failures
-log Write logs to
-m Maximum number of simultaneous copies 表示并发数
-overwrite Overwrite destination
-update Overwrite if src size different from dst size 更新文件
-f Use list at as src list
-filelimit Limit the total number of files to be <= n
-sizelimit Limit the total size to be <= n bytes
-delete Delete the files existing in the dst but not in src
-skipcrccheck 跳过hdfs校验
操作代码及记录:
[hadoop@emr-worker-9 ~]$ hadoop distcp 'hdfs://iZbp1i7owezn4tfeft11m3Z:9000/user/hive/warehouse/bi_access_mini_log/' 'hdfs://emr-cluster/user/hive/warehouse/bi_access_mini_log/'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/apps/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/apps/hbase-1.1.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/12/15 11:36:46 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs://iZbp1i7owezn4tfeft11m3Z:9000/user/hive/warehouse/bi_access_mini_log], targetPath=hdfs://emr-cluster/user/hive/warehouse/bi_access_mini_log, targetPathExists=false, preserveRawXattrs=false}
16/12/15 11:36:48 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
16/12/15 11:36:48 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor
16/12/15 11:36:48 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
16/12/15 11:36:48 INFO mapreduce.JobSubmitter: number of splits:21
16/12/15 11:36:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1469669894019_1217827
16/12/15 11:36:49 INFO impl.YarnClientImpl: Submitted application application_1469669894019_1217827
16/12/15 11:36:49 INFO mapreduce.Job: The url to track the job: http://iZ23tdmgoi9Z:20888/proxy/application_1469669894019_1217827/
16/12/15 11:36:49 INFO tools.DistCp: DistCp job-id: job_1469669894019_1217827
16/12/15 11:36:49 INFO mapreduce.Job: Running job: job_1469669894019_1217827
16/12/15 11:36:55 INFO mapreduce.Job: Job job_1469669894019_1217827 running in uber mode : false
16/12/15 11:36:55 INFO mapreduce.Job: map 0% reduce 0%
16/12/15 11:37:05 INFO mapreduce.Job: map 2% reduce 0%
16/12/15 11:37:06 INFO mapreduce.Job: map 5% reduce 0%
16/12/15 11:37:07 INFO mapreduce.Job: map 8% reduce 0%
16/12/15 11:37:08 INFO mapreduce.Job: map 11% reduce 0%
16/12/15 11:37:09 INFO mapreduce.Job: map 12% reduce 0%
16/12/15 11:37:10 INFO mapreduce.Job: map 14% reduce 0%
16/12/15 11:37:11 INFO mapreduce.Job: map 17% reduce 0%
16/12/15 11:37:12 INFO mapreduce.Job: map 18% reduce 0%
16/12/15 11:37:13 INFO mapreduce.Job: map 20% reduce 0%
16/12/15 11:37:14 INFO mapreduce.Job: map 24% reduce 0%
16/12/15 11:37:15 INFO mapreduce.Job: map 25% reduce 0%
16/12/15 11:37:16 INFO mapreduce.Job: map 26% reduce 0%
16/12/15 11:37:17 INFO mapreduce.Job: map 28% reduce 0%
16/12/15 11:37:18 INFO mapreduce.Job: map 30% reduce 0%
16/12/15 11:37:19 INFO mapreduce.Job: map 31% reduce 0%
16/12/15 11:37:20 INFO mapreduce.Job: map 32% reduce 0%
16/12/15 11:37:21 INFO mapreduce.Job: map 37% reduce 0%
16/12/15 11:37:22 INFO mapreduce.Job: map 38% reduce 0%
16/12/15 11:37:23 INFO mapreduce.Job: map 41% reduce 0%
16/12/15 11:37:24 INFO mapreduce.Job: map 42% reduce 0%
16/12/15 11:37:25 INFO mapreduce.Job: map 43% reduce 0%
16/12/15 11:37:26 INFO mapreduce.Job: map 46% reduce 0%
16/12/15 11:37:27 INFO mapreduce.Job: map 48% reduce 0%
16/12/15 11:37:29 INFO mapreduce.Job: map 52% reduce 0%
16/12/15 11:37:30 INFO mapreduce.Job: map 54% reduce 0%
16/12/15 11:37:31 INFO mapreduce.Job: map 55% reduce 0%
16/12/15 11:37:32 INFO mapreduce.Job: map 57% reduce 0%
16/12/15 11:37:33 INFO mapreduce.Job: map 59% reduce 0%
16/12/15 11:37:34 INFO mapreduce.Job: map 60% reduce 0%
16/12/15 11:37:35 INFO mapreduce.Job: map 62% reduce 0%
16/12/15 11:37:36 INFO mapreduce.Job: map 63% reduce 0%
16/12/15 11:37:37 INFO mapreduce.Job: map 64% reduce 0%
16/12/15 11:37:38 INFO mapreduce.Job: map 67% reduce 0%
16/12/15 11:37:39 INFO mapreduce.Job: map 68% reduce 0%
16/12/15 11:37:40 INFO mapreduce.Job: map 69% reduce 0%
16/12/15 11:37:41 INFO mapreduce.Job: map 72% reduce 0%
16/12/15 11:37:43 INFO mapreduce.Job: map 74% reduce 0%
16/12/15 11:37:44 INFO mapreduce.Job: map 76% reduce 0%
16/12/15 11:37:45 INFO mapreduce.Job: map 78% reduce 0%
16/12/15 11:37:46 INFO mapreduce.Job: map 79% reduce 0%
16/12/15 11:37:47 INFO mapreduce.Job: map 80% reduce 0%
16/12/15 11:37:48 INFO mapreduce.Job: map 81% reduce 0%
16/12/15 11:37:49 INFO mapreduce.Job: map 83% reduce 0%
16/12/15 11:37:50 INFO mapreduce.Job: map 85% reduce 0%
16/12/15 11:37:52 INFO mapreduce.Job: map 87% reduce 0%
16/12/15 11:37:53 INFO mapreduce.Job: map 90% reduce 0%
16/12/15 11:37:54 INFO mapreduce.Job: map 91% reduce 0%
16/12/15 11:37:55 INFO mapreduce.Job: map 92% reduce 0%
16/12/15 11:37:57 INFO mapreduce.Job: map 93% reduce 0%
16/12/15 11:37:58 INFO mapreduce.Job: map 94% reduce 0%
16/12/15 11:38:00 INFO mapreduce.Job: map 95% reduce 0%
16/12/15 11:38:01 INFO mapreduce.Job: map 96% reduce 0%
16/12/15 11:38:05 INFO mapreduce.Job: map 97% reduce 0%
16/12/15 11:38:10 INFO mapreduce.Job: map 98% reduce 0%
16/12/15 11:38:23 INFO mapreduce.Job: map 99% reduce 0%
16/12/15 11:38:41 INFO mapreduce.Job: map 100% reduce 0%
16/12/15 11:38:47 INFO mapreduce.Job: Job job_1469669894019_1217827 completed successfully
16/12/15 11:38:48 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2707436
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=13628130712
HDFS: Number of bytes written=13627077226
HDFS: Number of read operations=41210
HDFS: Number of large read operations=0
HDFS: Number of write operations=8011
Job Counters
Launched map tasks=21
Other local map tasks=21
Total time spent by all maps in occupied slots (ms)=1200847
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1200847
Total vcore-milliseconds taken by all map tasks=1200847
Total megabyte-milliseconds taken by all map tasks=1229667328
Map-Reduce Framework
Map input records=4173
Map output records=0
Input split bytes=2877
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=7504
CPU time spent (ms)=351120
Physical memory (bytes) snapshot=9003696128
Virtual memory (bytes) snapshot=63894507520
Total committed heap usage (bytes)=10645143552
File Input Format Counters
Bytes Read=1050609
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=13627077226
BYTESEXPECTED=13627077226
COPY=4173
[hadoop@emr-worker-9 ~]$
之后,再在目标Hive上进行建表及分区的添加即可。
[hadoop@emr-worker-9 distcp]$ ls
datacp_from2hive.sh
[hadoop@emr-worker-9 distcp]$ cat datacp_from2hive.sh
#!/bin/bash
#set -x
DB=$1
#获取hive表定义
ret=$(hive -e 'use ${DB};show tables;'|grep -v _es|grep -v _hb|grep -v importinfo)
for tem in $ret;
do
hive -e "use ${DB};show create table $tem" >> /tmp/secha.sh
echo -e ';\c' >> /tmp/secha.sh
done
#迁移hive的表数据
ret=$(hive -e 'use ${DB};show tables;'|grep -v _es|grep -v _hb|grep -v importinfo)
for tem in $ret;
do
hadoop distcp hdfs://master:9000/user/hive/warehouse/${DB}.db/$tem hdfs://192.168.0.21:8020/user/hive/warehouse/${DB}.db/$tem
done