利用协处理器endpoint实现批量删除功能

最近因为工作需要,用到了hbase的协处理器endpoint,遇到了一些坑。以批量删除功能为例记录一些endpoint的使用方法。至于hbase只是以及协处理器coprocessor的知识在此不做过多介绍。

1.安装protocbuf解释器安装

下载protobuf-2.5.0.tar.gz文件
选定一个目录,比如Downloads目录下,解压下载的源文件
tar -zxvf protobuf-2.5.0.tar.gz
进去到解压后的源码目录
cd protobuf-2.5.0
设定一个目录用来存放protobuf的编译文件
./configure --prefix=/home/work/protobuf/
编译安装protobuf-2.5.0
make && make install
如果make失败,请检查是否安装了对应的c++依赖库,否则会报错error: C++ preprocessor “/lib/cpp” fails sanity check)
安装提示所需要的c++依赖
centos下的安装方法为:
yum install glibc-headers
yum install gcc-c++
ubuntu下的安装方法为:
apt-get install build-essential
apt-get install g++

安装完成后需要配置环境变量,在环境变量的配置文件中加一行
export PATH=/home/work/protobuf/bin:$PATH
检查是否安装成功,成功后显示对应的版本信息
protoc --version

2.编写接口文件protobuf

下载hbase的源码包hbase-1.1.4-src.tar.gz ,解压压缩文件,在路径
hbase-1.1.4\hbase-examples\src\main\protobuf
下可以找到批量删除examlple对应的BulkDelete.proto 文件,此文件可直接使用,也可以基于这个做相应修改,文件内容如下:

option java_package = "org.apache.hadoop.hbase.coprocessor.example.generated";
option java_outer_classname = "BulkDeleteProtos";
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;

import "Client.proto";

message BulkDeleteRequest {
  required Scan scan = 1;
  required DeleteType deleteType = 2;
  optional uint64 timestamp = 3;
  required uint32 rowBatchSize = 4;

  enum DeleteType {
    ROW = 0;
    FAMILY = 1;
    COLUMN = 2;
    VERSION = 3;
  }
}

message BulkDeleteResponse {
  required uint64 rowsDeleted = 1; 
  optional uint64 versionsDeleted = 2;
}

service BulkDeleteService {
  rpc delete(BulkDeleteRequest)
    returns (BulkDeleteResponse);
} 

3.编译接口文件

修改 BulkDelete.proto 文件,将如下两行删掉

import "Client.proto";
required Scan scan = 1;

保存并拷贝BulkDelete.proto到虚拟机中上一步protobuf-2.5.0的源文件目录~/Downloads/protobuf-2.5.0/protobuf目录下,没有protobuf目录就新建一个。
切换到protobuf目录,cd ~/Downloads/protobuf-2.5.0/protobuf
执行命令:
protoc --java_out=~/Downloads BulkDelete.proto
执行完后我们会生成java_outer_classname 属性指定类文件BulkDeleteProtos.java
路径就在
~/Downloads/org/apache/hadoop/hbase/coprocessor/example/generated;
这个路径也就是java_package 属性指定的包名决定的。

这里讲一下去掉proto文件中Scan scan = 1的原因:
因为Scan不属于基本类型,直接将hbase原文件中的BulkDelete.proto文件用protoc命令进行编译的话,会报错无法识别类型Scan.
和java文件一样,要想编译通过,我们需要import依赖文件,
对于proto文件而言,其对应的文件类型就是.proto文件了,我们可以在hbase源码包hbase-1.1.4-src中找到hbase提供的对应的依赖文件,路径为:
hbase-1.1.4\hbase-protocol\src\main\protobuf
为了方便,我们可以把这个目录下的所有文件都拷贝到虚拟机
Downloads/protobuf-2.5.0/protobuf 目录中
恢复上面操作 BulkDelete.proto 文件中删掉的两行,这个时候再进行编译,就可以直接编译通过了
也可以直接参考hbase源码包,路径:
hbase-1.1.4-src\hbase-1.1.4\hbase-examples\src\main\java\org\apache\hadoop\hbase\coprocessor\example\generated

文件内容太长了,就不贴出来了

4.编写endpoint类来实现批量删除功能

其实就是编写功能实现类,在这个BulkDeleteEndpoint类中我们需要继承上一步编译生成的类BulkDeleteProtos以及接口Coprocessor,CoprocessorService ,可参考hbase官方给出代码,路径在
hbase-1.1.4-src\hbase-1.1.4\hbase-examples\src\main\java\org\apache\hadoop\hbase\coprocessor\example
代码内容如下:

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hbase.coprocessor.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.TreeSet;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.Coprocessor;
import org.apache.hadoop.hbase.CoprocessorEnvironment;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.HConstants.OperationStatusCode;
import org.apache.hadoop.hbase.client.Delete;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.coprocessor.CoprocessorException;
import org.apache.hadoop.hbase.coprocessor.CoprocessorService;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.coprocessor.example.generated.BulkDeleteProtos.BulkDeleteRequest;
import org.apache.hadoop.hbase.coprocessor.example.generated.BulkDeleteProtos.BulkDeleteRequest.DeleteType;
import org.apache.hadoop.hbase.coprocessor.example.generated.BulkDeleteProtos.BulkDeleteResponse;
import org.apache.hadoop.hbase.coprocessor.example.generated.BulkDeleteProtos.BulkDeleteResponse.Builder;
import org.apache.hadoop.hbase.coprocessor.example.generated.BulkDeleteProtos.BulkDeleteService;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
import org.apache.hadoop.hbase.protobuf.ResponseConverter;
import org.apache.hadoop.hbase.regionserver.OperationStatus;
import org.apache.hadoop.hbase.regionserver.Region;
import org.apache.hadoop.hbase.regionserver.RegionScanner;
import org.apache.hadoop.hbase.util.Bytes;

import com.google.protobuf.RpcCallback;
import com.google.protobuf.RpcController;
import com.google.protobuf.Service;

/**
 * Defines a protocol to delete data in bulk based on a scan. The scan can be range scan or with
 * conditions(filters) etc.This can be used to delete rows, column family(s), column qualifier(s) 
 * or version(s) of columns.When delete type is FAMILY or COLUMN, which all family(s) or column(s)
 * getting deleted will be determined by the Scan. Scan need to select all the families/qualifiers
 * which need to be deleted.When delete type is VERSION, Which column(s) and version(s) to be
 * deleted will be determined by the Scan. Scan need to select all the qualifiers and its versions
 * which needs to be deleted.When a timestamp is passed only one version at that timestamp will be
 * deleted(even if Scan fetches many versions). When timestamp passed as null, all the versions
 * which the Scan selects will get deleted.
 * 
 * </br> Example: <code><pre>
 * Scan scan = new Scan();
 * // set scan properties(rowkey range, filters, timerange etc).
 * HTable ht = ...;
 * long noOfDeletedRows = 0L;
 * Batch.Call&lt;BulkDeleteService, BulkDeleteResponse&gt; callable = 
 *     new Batch.Call&lt;BulkDeleteService, BulkDeleteResponse&gt;() {
 *   ServerRpcController controller = new ServerRpcController();
 *   BlockingRpcCallback&lt;BulkDeleteResponse&gt; rpcCallback = 
 *     new BlockingRpcCallback&lt;BulkDeleteResponse&gt;();
 *
 *   public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
 *     Builder builder = BulkDeleteRequest.newBuilder();
 *     builder.setScan(ProtobufUtil.toScan(scan));
 *     builder.setDeleteType(DeleteType.VERSION);
 *     builder.setRowBatchSize(rowBatchSize);
 *     // Set optional timestamp if needed
 *     builder.setTimestamp(timeStamp);
 *     service.delete(controller, builder.build(), rpcCallback);
 *     return rpcCallback.get();
 *   }
 * };
 * Map&lt;byte[], BulkDeleteResponse&gt; result = ht.coprocessorService(BulkDeleteService.class, scan
 *     .getStartRow(), scan.getStopRow(), callable);
 * for (BulkDeleteResponse response : result.values()) {
 *   noOfDeletedRows += response.getRowsDeleted();
 * }
 * </pre></code>
 */
public class BulkDeleteEndpoint extends BulkDeleteService implements CoprocessorService,
    Coprocessor {
  private static final String NO_OF_VERSIONS_TO_DELETE = "noOfVersionsToDelete";
  private static final Log LOG = LogFactory.getLog(BulkDeleteEndpoint.class);

  private RegionCoprocessorEnvironment env;

  @Override
  public Service getService() {
    return this;
  }

  @Override
  public void delete(RpcController controller, BulkDeleteRequest request,
      RpcCallback<BulkDeleteResponse> done) {
    long totalRowsDeleted = 0L;
    long totalVersionsDeleted = 0L;
    Region region = env.getRegion();
    int rowBatchSize = request.getRowBatchSize();
    Long timestamp = null;
    if (request.hasTimestamp()) {
      timestamp = request.getTimestamp();
    }
    DeleteType deleteType = request.getDeleteType();
    boolean hasMore = true;
    RegionScanner scanner = null;
    try {
      Scan scan = ProtobufUtil.toScan(request.getScan());
      if (scan.getFilter() == null && deleteType == DeleteType.ROW) {
        // What we need is just the rowkeys. So only 1st KV from any row is enough.
        // Only when it is a row delete, we can apply this filter.
        // In other types we rely on the scan to know which all columns to be deleted.
        scan.setFilter(new FirstKeyOnlyFilter());
      }
      // Here by assume that the scan is perfect with the appropriate
      // filter and having necessary column(s).
      scanner = region.getScanner(scan);
      while (hasMore) {
        List<List<Cell>> deleteRows = new ArrayList<List<Cell>>(rowBatchSize);
        for (int i = 0; i < rowBatchSize; i++) {
          List<Cell> results = new ArrayList<Cell>();
          hasMore = scanner.next(results);
          if (results.size() > 0) {
            deleteRows.add(results);
          }
          if (!hasMore) {
            // There are no more rows.
            break;
          }
        }
        if (deleteRows.size() > 0) {
          Mutation[] deleteArr = new Mutation[deleteRows.size()];
          int i = 0;
          for (List<Cell> deleteRow : deleteRows) {
            deleteArr[i++] = createDeleteMutation(deleteRow, deleteType, timestamp);
          }
          OperationStatus[] opStatus = region.batchMutate(deleteArr, HConstants.NO_NONCE,
            HConstants.NO_NONCE);
          for (i = 0; i < opStatus.length; i++) {
            if (opStatus[i].getOperationStatusCode() != OperationStatusCode.SUCCESS) {
              break;
            }
            totalRowsDeleted++;
            if (deleteType == DeleteType.VERSION) {
              byte[] versionsDeleted = deleteArr[i].getAttribute(
                  NO_OF_VERSIONS_TO_DELETE);
              if (versionsDeleted != null) {
                totalVersionsDeleted += Bytes.toInt(versionsDeleted);
              }
            }
          }
        }
      }
    } catch (IOException ioe) {
      LOG.error(ioe);
      // Call ServerRpcController#getFailedOn() to retrieve this IOException at client side.
      ResponseConverter.setControllerException(controller, ioe);
    } finally {
      if (scanner != null) {
        try {
          scanner.close();
        } catch (IOException ioe) {
          LOG.error(ioe);
        }
      }
    }
    Builder responseBuilder = BulkDeleteResponse.newBuilder();
    responseBuilder.setRowsDeleted(totalRowsDeleted);
    if (deleteType == DeleteType.VERSION) {
      responseBuilder.setVersionsDeleted(totalVersionsDeleted);
    }
    BulkDeleteResponse result = responseBuilder.build();
    done.run(result);
  }

  private Delete createDeleteMutation(List<Cell> deleteRow, DeleteType deleteType,
      Long timestamp) {
    long ts;
    if (timestamp == null) {
      ts = HConstants.LATEST_TIMESTAMP;
    } else {
      ts = timestamp;
    }
    // We just need the rowkey. Get it from 1st KV.
    byte[] row = CellUtil.cloneRow(deleteRow.get(0));
    Delete delete = new Delete(row, ts);
    if (deleteType == DeleteType.FAMILY) {
      Set<byte[]> families = new TreeSet<byte[]>(Bytes.BYTES_COMPARATOR);
      for (Cell kv : deleteRow) {
        if (families.add(CellUtil.cloneFamily(kv))) {
          delete.deleteFamily(CellUtil.cloneFamily(kv), ts);
        }
      }
    } else if (deleteType == DeleteType.COLUMN) {
      Set<Column> columns = new HashSet<Column>();
      for (Cell kv : deleteRow) {
        Column column = new Column(CellUtil.cloneFamily(kv), CellUtil.cloneQualifier(kv));
        if (columns.add(column)) {
          // Making deleteColumns() calls more than once for the same cf:qualifier is not correct
          // Every call to deleteColumns() will add a new KV to the familymap which will finally
          // get written to the memstore as part of delete().
          delete.deleteColumns(column.family, column.qualifier, ts);
        }
      }
    } else if (deleteType == DeleteType.VERSION) {
      // When some timestamp was passed to the delete() call only one version of the column (with
      // given timestamp) will be deleted. If no timestamp passed, it will delete N versions.
      // How many versions will get deleted depends on the Scan being passed. All the KVs that
      // the scan fetched will get deleted.
      int noOfVersionsToDelete = 0;
      if (timestamp == null) {
        for (Cell kv : deleteRow) {
          delete.deleteColumn(CellUtil.cloneFamily(kv), CellUtil.cloneQualifier(kv), kv.getTimestamp());
          noOfVersionsToDelete++;
        }
      } else {
        Set<Column> columns = new HashSet<Column>();
        for (Cell kv : deleteRow) {
          Column column = new Column(CellUtil.cloneFamily(kv), CellUtil.cloneQualifier(kv));
          // Only one version of particular column getting deleted.
          if (columns.add(column)) {
            delete.deleteColumn(column.family, column.qualifier, ts);
            noOfVersionsToDelete++;
          }
        }
      }
      delete.setAttribute(NO_OF_VERSIONS_TO_DELETE, Bytes.toBytes(noOfVersionsToDelete));
    }
    return delete;
  }

  private static class Column {
    private byte[] family;
    private byte[] qualifier;

    public Column(byte[] family, byte[] qualifier) {
      this.family = family;
      this.qualifier = qualifier;
    }

    @Override
    public boolean equals(Object other) {
      if (!(other instanceof Column)) {
        return false;
      }
      Column column = (Column) other;
      return Bytes.equals(this.family, column.family)
          && Bytes.equals(this.qualifier, column.qualifier);
    }

    @Override
    public int hashCode() {
      int h = 31;
      h = h + 13 * Bytes.hashCode(this.family);
      h = h + 13 * Bytes.hashCode(this.qualifier);
      return h;
    }
  }

  @Override
  public void start(CoprocessorEnvironment env) throws IOException {
    if (env instanceof RegionCoprocessorEnvironment) {
      this.env = (RegionCoprocessorEnvironment) env;
    } else {
      throw new CoprocessorException("Must be loaded on a table region!");
    }
  }

  @Override
  public void stop(CoprocessorEnvironment env) throws IOException {
    // nothing to do
  }
}

5.编写自定义类充当客户端来调用批量删除协处理器BulkDeleteEndpoint

官方示例代码:

 Scan scan = new Scan();
 // set scan properties(rowkey range, filters, timerange etc).
 HTable ht = ...;//这里需要写入自己的hbase表名,并确保已经初始化并连接上hbase
 long noOfDeletedRows = 0L;
 Batch.Call<BulkDeleteService, BulkDeleteResponse> callable = 
     new Batch.Call<BulkDeleteService, BulkDeleteResponse>() {
   ServerRpcController controller = new ServerRpcController();
   BlockingRpcCallback<BulkDeleteResponse> rpcCallback = 
     new BlockingRpcCallback<BulkDeleteResponse>();

   public BulkDeleteResponse call(BulkDeleteService service) throws IOException {
     Builder builder = BulkDeleteRequest.newBuilder();
     builder.setScan(ProtobufUtil.toScan(scan));
     builder.setDeleteType(DeleteType.VERSION);
     builder.setRowBatchSize(rowBatchSize);
     // Set optional timestamp if needed
     builder.setTimestamp(timeStamp);
     service.delete(controller, builder.build(), rpcCallback);
     return rpcCallback.get();
   }
 };
 Map<byte[], BulkDeleteResponse> result = ht.coprocessorService(BulkDeleteService.class, scan
     .getStartRow(), scan.getStopRow(), callable);
 for (BulkDeleteResponse response : result.values()) {
   noOfDeletedRows += response.getRowsDeleted();
 }

6.部署我们的协处理器

把BulkDeleteProtos.java 和BulkDeleteEndpoint .java打成jar包coprocessor.jar,放在hadoop的文件系统hdfs中,比如绝对路径 /test目录下
推荐使用命令行的方式部署协处理器
1.先disable需要加载协处理器的表 ‘test_table’
2.输入如下这一条命令
alter 'test_table',METHOD=>'table_att','coprocessor'=>'hdfs:///test/coprocessor.jar|org.apache.hadoop.hbase.coprocessor.example.BulkDeleteEndpoint |1001|'
成功后再将表enable
注意:
1. 1001是当前部署的协处理器的优先级,可以根据需要自己设定
2. 部署endpoint协处理器是一个非常危险的过程,如果部署失败,哪怕是命令中将文件名写错了,就会到导致regionService挂掉,然后整个hbase服务器就挂了(亲身经历,受过伤了才知道痛)。
hbase官方给出的解决办法是把配置文件hbase-site中hbae.coprocessor.abortonerror的值设置为false,这样就是等同于debug模式了

<property>
<name>hbase.coprocessor.abortonerror</name>
<value>false</value>
</property>

3.部署完协处理后,需要子hdfs中把部署的jar(比如本文中的coprocessor.jar)赋予可执行权限,否则很可能会出现协处理器部署成功,但是调用后无任何返回结果,而且ResgionService中也没有错误日志;
命令如为:sudo -u hdfs hadoop fs -chmod -R 777 /test/coprocessor.jar

对应卸载协处理器的方法

先查看表结构信息:
desc ‘table_test’
根据表结构信息,可以看到部署的协处理器是第几个,因为只部署了一个,所以看到会是coprocessor$1
再执行卸载命令:
alter ‘test_table’,METHOD=>’table_att_unset’,
NAME=>’coprocessor$1’
注意: NAME=>’coprocessor$1’中的值要和表结构对应

参考文章:
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值