DistCp迁移Hive数据过程中源集群增加删除文件等场景测试

1 概述

  由于在数据迁移过程中,上层任务仍在运行,会出现源集群新增文件及删除文件的情况,因此进行测试

2 需要同步的数据文件说明

  源集群:192.168.40.100:8020,目标集群:192.168.40.200:8020
  数据目录及其大小如下

1.8 G    5.5 G    /user/hive/warehouse/iot.db/dwd_pollution_distcp

  这个文件对应的表为dwd_pollution_distcp,共有20140801-20140930这些分区

3 元数据迁移

  源集群与目标集群元数据存储都是采用pg,pg库迁移如下:
  ①新主机PostgreSQL需要事先建立和原主机名称相同的用户和数据库
  ②备份原主机数据库,格式:pg_dump -U -p >
  在源集群的pg库中导出数据

./pg_dump -U hive-p 5432 hive > ./hive_meta.sql

  ③导入数据到新主机
  再目标集群的pg库执行导入命令

./psql -U hive -p 5432 hive < hive_meta.sql

4 源集群新增文件测试

4.1 执行distcp命令

hadoop distcp -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/ 

  distcp执行期间执行如下SQL,插入数据到新分区

insert overwrite table iot.dwd_pollution_distcp partition(stat_date) 
select ozone,particullate_matter,carbon_monoxide,sulfure_dioxide,nitrogen_dioxide,longitude,latitude,date_time,"20201021" as stat_date from iot.dwd_pollution

  任务完成,并且没有报错

...
20/10/21 11:16:21 INFO mapreduce.Job: Job job_1602207790062_0240 completed successfully
20/10/21 11:16:21 INFO mapreduce.Job: Counters: 38
	File System Counters
		FILE: Number of bytes read=0
		FILE: Number of bytes written=4923884
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=5921518162
		HDFS: Number of bytes written=5921449140
		HDFS: Number of read operations=1962
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=1249
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Launched map tasks=22
		Other local map tasks=22
		Total time spent by all maps in occupied slots (ms)=3720826
		Total time spent by all reduces in occupied slots (ms)=0
		Total time spent by all map tasks (ms)=3720826
		Total vcore-milliseconds taken by all map tasks=3720826
		Total megabyte-milliseconds taken by all map tasks=3810125824
	Map-Reduce Framework
		Map input records=245
		Map output records=0
		Input split bytes=2530
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=4309
		CPU time spent (ms)=167940
		Physical memory (bytes) snapshot=8456990720
		Virtual memory (bytes) snapshot=59629797376
		Total committed heap usage (bytes)=13333692416
		Peak Map Physical memory (bytes)=418676736
		Peak Map Virtual memory (bytes)=2735362048
	File Input Format Counters 
		Bytes Read=66492
	File Output Format Counters 
		Bytes Written=0
	DistCp Counters
		Bandwidth in Btyes=69105400
		Bytes Copied=5921449140
		Bytes Expected=5921449140
		Files Copied=182
		DIR_COPY=63

4.2 数据文件核对

  源集群文件大小

[hdfs@hadoop-local-02 ~]$ hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
5.7 G  17.1 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp

[hdfs@hadoop-local-02 ~]$ hdfs dfs -ls /user/hive/warehouse/iot.db/dwd_pollution_distcp/
Found 63 items

  目标集群文件大小

[hdfs@hadoop-cdh-03 ~]$ hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
5.5 G  16.5 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp
[hdfs@hadoop-cdh-03 ~]$ hdfs dfs -ls /user/hive/warehouse/iot.db/dwd_pollution_distcp/
Found 62 items

  发现再执行DistCp过程中插入的数据不影响DistCp的使用,但是插入的数据不能及时发现

4.3 源集群新增文件处理方法

  可以重新执行命令使用DistCp的update参数更新增量数据

hadoop distcp -update -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp

4.4 核对更新后文件是否一致

  (1)文件核对
  源集群文件大小

[hdfs@hadoop-local-02 ~]$ hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
5.7 G  17.1 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp

[hdfs@hadoop-local-02 ~]$ hdfs dfs -ls /user/hive/warehouse/iot.db/dwd_pollution_distcp/
Found 63 items

  目标集群文件大小

[hdfs@hadoop-cdh-03 ~]$ hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
5.7 G  17.1 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp
[hdfs@hadoop-cdh-03 ~]$ hdfs dfs -ls /user/hive/warehouse/iot.db/dwd_pollution_distcp/
Found 63 items

  (2)SQL核对
  首先在目标集群修复分区,因为新插入的那个分区元数据在目标集群的hive元数据中没有

hive> MSCK REPAIR TABLE dwd_pollution_distcp;

  源集群hive查询条数

hive> select count(*) from iot.dwd_pollution_distcp;
OK
138250266

  目标集群hive查询条数

hive> select count(*) from iot.dwd_pollution_distcp;
OK
138250266

5 源集群删除文件测试

  这里所说的删除文件是指SQL在执行的时候执行了overwrite操作,会先删除原有数据在新增。测试前清空目标集群中迁移过去的数据,源集群恢复最原始状态

5.1 执行distcp命令

hadoop distcp -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/ 

  distcp执行期间执行如下SQL,会将这个表的20140801-20140930分区先删除再新增数据

insert overwrite table iot.dwd_pollution_distcp partition(stat_date) 
select ozone,particullate_matter,carbon_monoxide,sulfure_dioxide,nitrogen_dioxide,longitude,latitude,date_time,stat_date as stat_date from iot.dwd_pollution

  错误信息如下

Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140810/000008_0 to hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140810/000008_0
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:256)
	... 10 more
Caused by: java.io.FileNotFoundException: File does not exist: /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140810/000008_0
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:735)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:415)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:870)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:816)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2680)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:855)
	at org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1768)
	at org.apache.hadoop.hdfs.DFSClient.getFileChecksumInternal(DFSClient.java:1714)
	at org.apache.hadoop.hdfs.DFSClient.getFileChecksumWithCombineMode(DFSClient.java:1744)
	at org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1610)
	at org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1607)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1619)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:119)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
	... 11 more

5.2 核对数据

  查看源集群文件,部分文件如下

[hdfs@hadoop-local-02 ~]$ hdfs dfs -du -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
969.3 K  2.8 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140801
1.7 M    5.0 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140802
2.2 M    6.7 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140803
2.5 M    7.4 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140804
2.7 M    8.0 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140805

  查看目标集群文件,部分文件如下

[hdfs@hadoop-cdh-03 ~]$ hdfs dfs -du -h /user/hive/warehouse/iot.db/dwd_pollution_distcp
969.3 K  2.8 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140801
16.8 M   50.4 M   /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140802
2.2 M    6.7 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140803
2.5 M    7.4 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140804
26.6 M   79.8 M   /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20140805

  发现数据完全核对不上

5.3 使用update与delete参数修复

  重新执行命令,并且加上update与delete参数

hadoop distcp -p -update -delete  hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/ 

  查看源集群文件大小与数据记录数

[hdfs@hadoop-local-02 ~]$ hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp/
376.5 M  1.1 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp

hive> select count(*) from dwd_pollution_distcp;
OK
8919106
Time taken: 7.39 seconds, Fetched: 1 row(s)

  查看目标集群文件大小与数据记录数

[hdfs@hadoop-cdh-03 ~]$  hdfs dfs -du -s -h /user/hive/warehouse/iot.db/dwd_pollution_distcp/
376.5 M  1.1 G  /user/hive/warehouse/iot.db/dwd_pollution_distcp

hive> select count(*) from dwd_pollution_distcp;
OK
8919106
Time taken: 7.39 seconds, Fetched: 1 row(s)

  注意:hive要先执行MSCK REPAIR TABLE dwd_pollution_distcp进行分区修复,因为前面的测试导致分区元数据信息没更新

5.4 执行overwrite时的另一种情况

  前面测试的是overwrite时数据还没copy完,还有一种情况就是overwrite是数据已经copy完。测试前清空目标集群中迁移过去的数据
  (1)执行distcp

hadoop distcp -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/

  (2)在程序快执行完时执行SQL

insert overwrite table iot.dwd_pollution_distcp partition(stat_date) select ozone,particullate_matter,carbon_monoxide,sulfure_dioxide,nitrogen_dioxide,longitude,latitude,date_time,'20141001' as stat_date from iot.dwd_pollution;

  程序正常执行成功
  (3)查询未执行SQl前后20141001这个分区的数据量

  源集群执行前

345.3 K  1.0 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20141001

  源集群执行后

188.2 M  564.7 M  /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20141001

  查看目标集群

345.3 K  1.0 M    /user/hive/warehouse/iot.db/dwd_pollution_distcp/stat_date=20141001

  (4)修复
  重新执行distcp命令加上update和delete参数

6 总结

  (1)disctp在拷贝过程中对源集群拷贝的文件夹下增加文件不影  响拷贝,但是这个增加的文件不会再拷贝列表中。后续可以通过update参数对这个增加的文件再进行迁移。

  (2)迁移的过程中老集群目录因sql执行了ovewrite删除了文件怎么办?
  分两种情况:①执行ovewrite的时候对应分区数据已经copy完,执行distcp程序不报错,只能在通过sql查看条数来判断

  ②执行ovewrite的时候对应分区数据没有copy完,执行distcp程序报错,可以通过-i参数忽略失败,那么其他没有影响的分区会正常copy处理:使用-update对文件进行更新(如果目的文件的名称和大小与源文件不同,则覆盖;若目的文件大小和名称与源文件相同则跳过),需要加-detele(如果目的目录存在源目录中不存在的文件,则删除;走hdfs垃圾回收站),如果该分区ovewrite之后数据减少,导致存在了老集群不存在的文件,需要将其删除

  (3)迁移了一半,任务失败了怎么办?

  ①-update配合-detele进行重新执行命令

  ②删除掉新集群中的脏数据,重新执行迁移命令。不加-overwrite参数,来跳过已存在的文件。

  • 3
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值