文章《HBase备份之导入导出》介绍了使用HBase的自带工具Export和Import来实现在主集群和从集群之间拷贝表的目的。本篇介绍一种相比导入导出而言,更快速的一种备份办法。即ExportSnapshot。
1、ExportSnapshot
和Export类似,ExportSnapshot也是使用MapReduce方式来进行表的拷贝。不过和Export不同,ExportSnapshot导出的是表的快照。我们可以使用ExportSnapshot将表的快照数据先导出到从集群,然后再从集群中使用restore_snapshot命令恢复快照,即可实现表在主从集群之间的复制工作。具体的操作步骤如下:
1)在主集群中为表建立快照
-
$ cd $HBASE_HOME/
-
$ bin/hbase shell
-
2014-08-13 15:59:12,495 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
-
HBase Shell; enter 'help<RETURN>' for list of supported commands.
-
Type "exit<RETURN>" to leave the HBase Shell
-
Version 0.98.2-hadoop2, r1591526, Wed Apr 30 20:17:33 PDT 2014
-
hbase(main):001:0> snapshot 'test_table', 'test_table_snapshot'
-
0 row(s) in 0.3370 seconds
2)使用ExportSnapshot命令导出快照数据
-
$ cd $HBASE_HOME/
-
$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot test_table_snapshot -copy-to hdfs://follow_cluster_namenode:8082/hbase
其中,test_table_snapshot为刚建的快照名,hdfs://follow_cluster_namenode:8082/hbase为从集群的hbase的hdfs根目录的全路径。
ExportSnapshot命令也可以限定mapper个数,如下:
$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot test_table_snapshot -copy-to hdfs://follow_cluster_namenode:8082/hbase -mapers n
还可以限定拷贝的流量,如下:
$ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot test_table_snapshot -copy-to hdfs://follow_cluster_namenode:8082/hbase -mapers n -bandwidth 200
上面的例子将拷贝的流量限定为200M。
执行ExportSnapshot命令之后的输出很长,部分如下:
-
2014-08-13 16:08:26,318 INFO [main] mapreduce.Job: Running job: job_1407910396081_0027
-
2014-08-13 16:08:33,494 INFO [main] mapreduce.Job: Job job_1407910396081_0027 running in uber mode : false
-
2014-08-13 16:08:33,495 INFO [main] mapreduce.Job: map 0% reduce 0%
-
2014-08-13 16:08:41,567 INFO [main] mapreduce.Job: map 100% reduce 0%
-
2014-08-13 16:08:42,581 INFO [main] mapreduce.Job: Job job_1407910396081_0027 completed successfully
-
2014-08-13 16:08:42,677 INFO [main] mapreduce.Job: Counters: 30
-
File System Counters
-
FILE: Number of bytes read=0
-
FILE: Number of bytes written=116030
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=1386
-
HDFS: Number of bytes written=988
-
HDFS: Number of read operations=7
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=3
-
Job Counters
-
Launched map tasks=1
-
Rack-local map tasks=1
-
Total time spent by all maps in occupied slots (ms)=13518
-
Total time spent by all reduces in occupied slots (ms)=0
-
Map-Reduce Framework
-
Map input records=1
-
Map output records=0
-
Input split bytes=174
-
Spilled Records=0
-
Failed Shuffles=0
-
Merged Map outputs=0
-
GC time elapsed (ms)=23
-
CPU time spent (ms)=1860
-
Physical memory (bytes) snapshot=323575808
-
Virtual memory (bytes) snapshot=1867042816
-
Total committed heap usage (bytes)=1029177344
-
org.apache.hadoop.hbase.snapshot.ExportSnapshot$Counter
-
BYTES_COPIED=988
-
BYTES_EXPECTED=988
-
FILES_COPIED=1
-
File Input Format Counters
-
Bytes Read=224
-
File Output Format Counters
-
Bytes Written=0
-
2014-08-13 16:08:42,685 INFO [main] snapshot.ExportSnapshot: Finalize the Snapshot Export
-
2014-08-13 16:08:42,697 INFO [main] snapshot.ExportSnapshot: Verify snapshot validity
-
2014-08-13 16:08:42,698 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
-
2014-08-13 16:08:42,713 INFO [main] snapshot.ExportSnapshot: Export Completed: test_table_snapshot
3)到从集群中恢复快照
-
$ cd $HBASE_HOME/
-
$ bin/hbase shell
-
2014-08-13 16:16:13,817 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
-
HBase Shell; enter 'help<RETURN>' for list of supported commands.
-
Type "exit<RETURN>" to leave the HBase Shell
-
Version 0.98.2-hadoop2, r1591526, Wed Apr 30 20:17:33 PDT 2014
-
hbase(main):001:0> restore_snapshot 'test_table_snapshot'
-
0 row(s) in 16.4940 seconds
4)查看表是否恢复成功
-
hbase(main):002:0> list
-
TABLE test_table
-
1 row(s) in 1.0460 seconds
-
=> ["test_table"]
另外,还可以通过scan或count命令进行检验。
快照恢复操作一般会很快,相比较Export和Import需要导出和导入两次MapReduce任务才能完成表的复制来讲,使用ExportSnapshot会快很多。
2、CopyTable
首先,看一下CopyTable命令的使用方法
-
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
-
Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
-
Options:
-
rs.class hbase.regionserver.class of the peer cluster
-
specify if different from current cluster
-
rs.impl hbase.regionserver.impl of the peer cluster
-
startrow the start row
-
stoprow the stop row
-
starttime beginning of the time range (unixtime in millis)
-
without endtime means from starttime to forever
-
endtime end of the time range. Ignored if no starttime specified.
-
versions number of cell versions to copy
-
new.name new table's name
-
peer.adr Address of the peer cluster given in the format
-
hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
-
families comma-separated list of families to copy
-
To copy from cf1 to cf2, give sourceCfName:destCfName.
-
To keep the same name, just give "cfName"
-
all.cells also copy delete markers and deleted cells
-
Args:
-
tablename Name of the table to copy
-
Examples:
-
To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
-
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
-
For performance consider the following general options:
-
-Dhbase.client.scanner.caching=100
-
-Dmapred.map.tasks.speculative.execution=false
可以看到,它支持设定需要复制的表的时间范围,cell的版本,也可以指定列簇,设定从集群的地址等。
对于上面的test_table表,我们可以使用如下命令进行拷贝:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=slave1,slave2,slave3:2181:/hbase test_table
注意:在使用上述语句之前,需要在从集群建立一个模式和主集群表test_table相同的表。
使用上述语句的部分执行结果如下:
-
2014-08-13 16:18:21,812 INFO [main] mapreduce.Job: Running job: job_1407910396081_0062
-
2014-08-13 16:18:29,955 INFO [main] mapreduce.Job: Job job_1407910396081_0062 running in uber mode : false
-
2014-08-13 16:18:29,957 INFO [main] mapreduce.Job: map 0% reduce 0%
-
2014-08-13 16:18:36,005 INFO [main] mapreduce.Job: map 100% reduce 0%
-
2014-08-13 16:18:37,029 INFO [main] mapreduce.Job: Job job_1407910396081_0062 completed successfully
-
2014-08-13 16:18:37,137 INFO [main] mapreduce.Job: Counters: 37
-
File System Counters
-
FILE: Number of bytes read=0
-
FILE: Number of bytes written=117527
-
FILE: Number of read operations=0
-
FILE: Number of large read operations=0
-
FILE: Number of write operations=0
-
HDFS: Number of bytes read=88
-
HDFS: Number of bytes written=0
-
HDFS: Number of read operations=1
-
HDFS: Number of large read operations=0
-
HDFS: Number of write operations=0
-
Job Counters
-
Launched map tasks=1
-
Rack-local map tasks=1
-
Total time spent by all maps in occupied slots (ms)=9740
-
Total time spent by all reduces in occupied slots (ms)=0
-
Map-Reduce Framework
-
Map input records=1
-
Map output records=1
-
Input split bytes=88
-
Spilled Records=0
-
Failed Shuffles=0
-
Merged Map outputs=0
-
GC time elapsed (ms)=254
-
CPU time spent (ms)=1810
-
Physical memory (bytes) snapshot=345137152
-
Virtual memory (bytes) snapshot=1841782784
-
Total committed heap usage (bytes)=1029177344
-
HBase Counters
-
BYTES_IN_REMOTE_RESULTS=34
-
BYTES_IN_RESULTS=34
-
MILLIS_BETWEEN_NEXTS=254
-
NOT_SERVING_REGION_EXCEPTION=0
-
NUM_SCANNER_RESTARTS=0
-
REGIONS_SCANNED=1
-
REMOTE_RPC_CALLS=3
-
REMOTE_RPC_RETRIES=0
-
RPC_CALLS=3
-
RPC_RETRIES=0
-
File Input Format Counters
-
Bytes Read=0
-
File Output Format Counters
-
Bytes Written=0
然后,就可以对比主集群中的表和从集群中对应的表数据是否一致。
转载请注明出处:http://blog.csdn.net/iAm333