HBase Coprocessor编程案例

最新推荐文章于 2022-09-02 16:24:45 发布

u013063153

最新推荐文章于 2022-09-02 16:24:45 发布

阅读量753

点赞数

分类专栏： HBase

本文链接：https://blog.csdn.net/u013063153/article/details/76974611

版权

HBase 专栏收录该内容

44 篇文章 1 订阅

订阅专栏

1. 启用协处理器Aggregation(Enable Coprocessor Aggregation)

两种方法：

(1) 启动全局aggregation，能够操作所用表上的数据。通过修改hbase-site.xml文件实现，

<property>

   <name>hbase.coprocessor.user.region.classes</name>

   <value>org.apache.hadoop.hbase.coprocessor.AggregateImplementation</value>

 </property>

(2)启动表aggregation，只对特定的表生效。通过HBase Shell来实现：

2.1 disable 指定表：hbase> disable 'mytable'

2.2 添加aggregation hbase> alter 'mytable', METHOD =>'table_att','coprocessor'=>'|org.apache.hadoop.hbase.coprocessor.AggregateImplementation||'

2.3 重启指定表 hbase> enable 'mytable'

2. 统计行数代码(Code Snippet)

public class MyAggregationClient {

    private static final byte[] TABLE_NAME = Bytes.toBytes("mytable");
    private static final byte[] CF = Bytes.toBytes("vent");

    public static void main(String[] args) throws Throwable {
        Configuration customConf = new Configuration();
        customConf.setStrings("hbase.zookeeper.quorum",
                "node0,node1,node2");
        //提高RPC通信时长
        customConf.setLong("hbase.rpc.timeout", 600000);
        //设置Scan缓存
        customConf.setLong("hbase.client.scanner.caching", 1000);
        Configuration configuration = HBaseConfiguration.create(customConf);
        AggregationClient aggregationClient = new AggregationClient(
                configuration);
        Scan scan = new Scan();
        //指定扫描列族，唯一值
        scan.addFamily(CF);
        long rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
        System.out.println("row count is " + rowCount);

    }
}

3.案例，建立二级索引

HBase Coprocessor 其中的一个作用是使用Observer创建二级索引。

案例：
要查询指定店铺指定客户购买的订单，首先有一张订单详情表，它以被处理后的订单id作为rowkey;

其次有一张以客户nick为rowkey的索引表，表结构如下：
rowkey family

dp_id+buy_nick1 tid1:null tid2:null

dp_id+buy_nick2 tid3:null

该表可以通过Coprocessor来构建，实例代码：

public class TestCoprocessor extends BaseRegionObserver {
    @Override
    public void prePut(final ObserverContext<RegionCoprocessorEnvironment> e,
                       final Put put, final WALEdit edit, final boolean writeToWAL)
            throws IOException {
        Configuration conf = new Configuration();
        HTable table = new HTable(conf, "index_table");
        List<KeyValue> kv = put.get("data".getBytes(), "name".getBytes());
        Iterator<KeyValue> kvItor = kv.iterator();
        while (kvItor.hasNext()) {
            KeyValue tmp = kvItor.next();
            Put indexPut = new Put(tmp.getValue());
            indexPut.add("index".getBytes(), tmp.getRow(), Bytes.toBytes(System.currentTimeMillis()));
            table.put(indexPut);
        }
        table.close();
    }
}

即继承BaseRegionObserver类，实现prePut方法，在插入订单详情表之前，向索引表插入索引数据。

4.索引表的使用

现在索引表get索引表，获取tids，然后根据tids查询订单详情表。

当有多个查询条件(多张索引表)，根据逻辑运算符(and, or)确定tids。

5.使用时注意事项

(1) 索引表是一张普通的hbase表，为安全考虑需要开启Hlog记录日志。
(2) 索引表的rowkey最好是不可变量，避免索引表中产生大量的脏数据。
(3) 如上例子，column是横向扩展的（宽表），rowkey设计除了要考虑region均衡，也要考虑column数量，即表不要太宽。建议不超过3位数。
(4) 如上代码，一个put操作其实是先后向两张表put数据，为保证一致性，需要考虑异常处理，建议异常时重试。

6. 效率情况

put操作效率不高，如上代码，每插入一条数据需要创建一个新的索引表连接（可以使用htablepool优化），向索引表插入数据。即耗时是双倍的，对hbase的集群的压力也是双倍的。当索引表有多个时，压力会更大。
查询效率比filter高，毫秒级别，因为都是rowkey的查询。
如上是估计的效率情况，需要根据实际业务场景和集群情况而定，最好做预先测试。