Online Apache HBase Backups with CopyTable

源自:http://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/

CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how to use it, and some common configuration caveats.

Use cases:

CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path interface. It can be used for many purposes:

  • Internal copy of a table (Poor man’s snapshot)

  • Remote HBase instance backup

  • Incremental HBase table copies

  • Partial HBase table copies and HBase table schema changes

Assumptions and limitations:

The CopyTable tool has some basic assumptions and limitations. First, if being used in the multi-cluster situation, both clusters must be online and the target instance needs to have the target table present with the same column families defined as the source table.

Since the tool uses standards scans and puts, the target cluster doesn’t have to have the same number of nodes or regions.  In fact, it can have different numbers of tables, different numbers of region servers, and could have completely different region split boundaries. Since we are copying entire tables, you can use performance optimization settings like setting larger scanner caching values for more efficiency. Using the put interface also means that copies can be made between clusters of different minor versions. (0.90.4 -> 0.90.6, CDH3u3 -> CDH3u4) or versions that are wire compatible (0.92.1 -> 0.94.0).

Finally, HBase only provides row-level ACID guarantees; this means while a CopyTable is going on, newly inserted or updated rows may occur and these concurrent edits will either be completely included or completely excluded. While rows will be consistent, there is no guarantees about the consistency, causality, or order of puts on the other rows.

Internal copy of a table (Poor man’s snapshot)

Versions of HBase up to and including the most recent 0.94.x versions do not support table snapshotting. Despite HBase’s ACID limitations, CopyTable can be used as a naive snapshotting mechanism that makes a physical copy of a particular table.

Let’s say that we have a table, tableOrig with column-families cf1 and cf2. We want to copy all its data to tableCopy. We need to first create tableCopy with the same column families:

1

srcCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell

We can then create and copy the table with a new name on the same HBase instance:

1

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy tableOrig

This starts an MR job that will copy the data.

Remote HBase instance backup

Let’s say we want to copy data to another cluster. This could be a one-off backup, a periodic job or could be for bootstrapping for cross-cluster replication. In this example, we’ll have two separate clusters: srcCluster and dstCluster.

In this multi-cluster case, CopyTable is a push process — your source will be the HBase instance your current hbase-site.xml refers to and the added arguments point to the destination cluster and table. This also assumes that all of the MR TaskTrackers can access all the HBase and ZK nodes in the destination cluster. This mechanism for configuration also means that you could run this as a job on a remote cluster by overriding the hbase/mr configs to use settings from any accessible remote cluster and specify the ZK nodes in the destination cluster. This could be useful if you wanted to copy data from an HBase cluster with lower SLAs and didn’t want to run MR jobs on them directly.

You will use the the –peer.adr setting to specify the destination cluster’s ZK ensemble (e.g. the cluster you are copying to). For this we need the ZK quorum’s IP and port as well as the HBase root ZK node for our HBase instance. Let’s say one of these machine is srcClusterZK (listed in hbase.zookeeper.quorum) and that we are using the default zk client port 2181 (hbase.zookeeper.property.clientPort) and the default ZK znode parent /hbase (zookeeper.znode.parent). (Note: If you had two HBase instances using the same ZK, you’d need a different zookeeper.znode.parent for each cluster.

1

2

3

4

5

# create new tableOrig on destination cluster

dstCluster$ echo "create 'tableOrig', 'cf1', 'cf2'" | hbase shell

# on source cluster run copy table with destination ZK quorum specified using --peer.adr

# WARNING: In older versions, you are not alerted about any typo in these arguments!

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase tableOrig

Note that you can use the –new.name argument with the –peer.adr to copy to a differently named table on the dstCluster.

1

2

3

4

# create new tableCopy on destination cluster

dstCluster$ echo "create 'tableCopy', 'cf1', 'cf2'" | hbase shell

# on source cluster run copy table with destination --peer.adr and --new.name arguments.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase --new.name=tableCopy tableOrig

This will copy data from tableOrig on the srcCluster to the dstCluster’s tableCopy table.

Incremental HBase table copies

Once you have a copy of a table on a destination cluster, how do you do copy new data that is later written to the source cluster? Naively, you could run the CopyTable job again and copy over the entire table. However, CopyTable provides a more efficient incremental copy mechanism that just copies the updated rows from the srcCluster to the backup dstCluster specified in a window of time. Thus, after the initial copy, you could then have a periodic cron job that copies data from only the previous hour from srcCluster to the dstCuster.

This is done by specifying the –starttime and –endtime arguments. Times are specified as decimal milliseconds since unix epoch time.

1

2

3

4

5

6

7

8

# WARNING: In older versions, you are not alerted about any typo in these arguments!

# copy from beginning of time until timeEnd 

# NOTE: Must include start time for end time to be respected. start time cannot be 0.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=1 --endtime=timeEnd ...

# Copy from starting from and including timeStart until the end of time.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timeStart ...

# Copy entries rows with start time1 including time1 and ending at timeStart excluding timeEnd.

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable ... --starttime=timestart --endtime=timeEnd

Partial HBase table copies and HBase table schema changes

By default, CopyTable will copy all column families from matching rows. CopyTable provides options for only copying data from specific column-families. This could be useful for copying original source data and excluding derived data column families that are added by follow on processing.

By adding these arguments we only copy data from the specified column families.

  • –families=srcCf1

  • –families=srcCf1,srcCf2

Starting from 0.92.0 you can copy while changing the column family name:

  • –families=srcCf1:dstCf1

    • copy from srcCf1 to dstCf1 

  • –families=srcCf1:dstCf1,dstCf2,srcCf3:dstCf3

    • copy from srcCf1 to destCf1, copy dstCf2 to dstCf2 (no rename), and srcCf3 to dstCf3

Please note that dstCf* must be present in the dstCluster table!

Starting from 0.94.0 new options are offered to copy delete markers and to include a limited number of overwritten versions. Previously, if a row is deleted in the source cluster, the delete would not be copied — instead that a stale version of that row would remain in the destination cluster. This takes advantage of some of the 0.94.0 release’s advanced features.

  • –versions=vers

    • where vers is the number of cell versions to copy (default is 1 aka the latest only)

  • –all.cells 

    • also copy delete markers and deleted cells

Common Pitfalls

The HBase client in the 0.90.x, 0.92.x, and 0.94.x versions always use zoo.cfg if it is in the classpath, even if an hbase-site.xml file specifies other ZooKeeper quorum configuration settings. This “feature” causes a problem common in CDH3 HBase because its packages default to including a directory where zoo.cfg lives in HBase’s classpath. This can and has lead to frustration when trying to use CopyTable (HBASE-4614). The workaround for this is to exclude the zoo.cfg file from your HBase’s classpath and to specify ZooKeeper configuration properties in your hbase-site.xml file. http://hbase.apache.org/book.html#zookeeper

Conclusion

CopyTable provides simple but effective disaster recovery insurance for HBase 0.90.x (CDH3) deployments. In conjunction with the replication feature found and supported in CDH4’s HBase 0.92.x based HBase, CopyTable’s incremental features become less valuable but its core functionality is important for bootstrapping a replicated table. While more advanced features such as HBase snapshots (HBASE-50) may aid with disaster recovery when it gets implemented, CopyTable will still be a useful tool for the HBase administrator.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
目标检测(Object Detection)是计算机视觉领域的一个核心问题,其主要任务是找出图像中所有感兴趣的目标(物体),并确定它们的类别和位置。以下是对目标检测的详细阐述: 一、基本概念 目标检测的任务是解决“在哪里?是什么?”的问题,即定位出图像中目标的位置并识别出目标的类别。由于各类物体具有不同的外观、形状和姿态,加上成像时光照、遮挡等因素的干扰,目标检测一直是计算机视觉领域最具挑战性的任务之一。 二、核心问题 目标检测涉及以下几个核心问题: 分类问题:判断图像中的目标属于哪个类别。 定位问题:确定目标在图像中的具体位置。 大小问题:目标可能具有不同的大小。 形状问题:目标可能具有不同的形状。 三、算法分类 基于深度学习的目标检测算法主要分为两大类: Two-stage算法:先进行区域生成(Region Proposal),生成有可能包含待检物体的预选框(Region Proposal),再通过卷积神经网络进行样本分类。常见的Two-stage算法包括R-CNN、Fast R-CNN、Faster R-CNN等。 One-stage算法:不用生成区域提议,直接在网络中提取特征来预测物体分类和位置。常见的One-stage算法包括YOLO系列(YOLOv1、YOLOv2、YOLOv3、YOLOv4、YOLOv5等)、SSD和RetinaNet等。 四、算法原理 以YOLO系列为例,YOLO将目标检测视为回归问题,将输入图像一次性划分为多个区域,直接在输出层预测边界框和类别概率。YOLO采用卷积网络来提取特征,使用全连接层来得到预测值。其网络结构通常包含多个卷积层和全连接层,通过卷积层提取图像特征,通过全连接层输出预测结果。 五、应用领域 目标检测技术已经广泛应用于各个领域,为人们的生活带来了极大的便利。以下是一些主要的应用领域: 安全监控:在商场、银行
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值