上一篇文章介绍了快照方式的迁移: hbase数据迁移:基于 hbase Snapshot。本文介绍基于CopyTable迁移。
copyTable是于HBase数据迁移的工具之一,以表级别
进行数据迁移。copyTable的本质也是利用MapReduce进行同步的,利用MR去scan
原表的数据,然后把scan出来的数据写入put
到目标集群的表。
copyTable优点是使用方便,简单,可以集群内复制,可以集群间复制,可以增量复制,还可以对表进行重命名。但是由于采用的
scan - put
方式,性能比较差,数据量大时不推荐使用。
首先查看一下CopyTable 的参数列表,还是比较丰富的:
$ ./hbase org.apache.hadoop.hbase.mapreduce.CopyTable
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
Options:
rs.class hbase.regionserver.class of the peer cluster
specify if different from current cluster
rs.impl hbase.regionserver.impl of the peer cluster
startrow the start row
stoprow the stop row
starttime beginning of the time range (unixtime in millis)
without endtime means from starttime to forever
endtime end of the time range. Ignored if no starttime specified.
versions number of cell versions to copy
new.name new table's name
peer.adr Address of the peer cluster given in the format
hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
families comma-separated list of families to copy
To copy from cf1 to cf2, give sourceCfName:destCfName.
To keep the same name, just give "cfName"
all.cells also copy delete markers and deleted cells
bulkload Write input into HFiles and bulk load to the destination table
Args:
tablename Name of the table to copy
Examples:
To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
For performance consider the following general option:
It is recommended that you set the following to >=100. A higher value uses more memory but
decreases the round trip time to the server and may increase performance.
-Dhbase.client.scanner.caching=100
The following should always be set to false, to prevent writing data twice, which may produce
inaccurate results.
-Dmapreduce.map.speculative=false
例:将表TestTable迁移到集群:server1,server2,server3:2181:/hbase,指定起始时间(增量)和列族,并且重新命名
bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 --new.name='newTestTable' TestTable
如果想要全量迁移,将起始时间去掉即可:
bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Dhbase.client.scanner.caching=200 -Dmapreduce.local.map.tasks.maximum=16 -Dmapred.map.tasks.speculative.execution=false --peer.adr=server1,server2,server3:2181:/hbase --new.name='newTestTable' TestTable
参数说明
- mapreduce.local.map.tasks.maximum
并行执行的最大map个数。不指定的话默认是1,所有任务都是串行执行的。 - hbase.client.scanner.caching
建议设置为大于100的数。这个数越大,使用的内存越多,但是会减少scan与服务端的交互次数,对提升读性能有帮助。 - mapred.map.tasks.speculative.execution
建议设置为false,避免因预测执行机制导致数据写两次。
更多参数参见官方文档 CopyTable
参考资料:https://yq.aliyun.com/articles/176546?utm_content=m_29050