HBase_HBase2.0 Java API 操作指南 (四) 过滤器 Filter

最新推荐文章于 2022-03-02 20:10:00 发布

高达一号

最新推荐文章于 2022-03-02 20:10:00 发布

阅读量1.1k

点赞数

分类专栏： HBase

本文链接：https://blog.csdn.net/u010003835/article/details/105708504

版权

HBase 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

Filter 执行流程

Filter相关类的层次结构

比较运算符可选项

CompareFilter 需要的比较器可选项

基于CompareFilter 相关的Filter 子类

基于FilterBase 相关的Filter 子类

在上一篇文章，我们讲解了 Hbase 中的扫描器，这篇文章，我们讲解下过滤器。

过滤器可以对扫描的数据进行一定程度的过滤，当然这个流程是在服务器端进行执行的。过滤器最基本的接口为 Filter, 除此之外，还有一些由HBase 提供的无需编程就可以直接使用的类。

Filter 执行流程

所有的过滤器都在服务端生效，叫做谓词下推。这样可以保证过滤掉的数据不会传送到客户端。

下图展示了过滤器是如何执行的：

Filter相关类的层次结构

Filter 最底层的结构是Fliter 接口，在其上层是 FilterBase 抽象类。在其上层是 CompareFilter 抽象类。

FilterBase ：

其中一部分过滤器的实现类直接继承自 FilterBase

CompareFilter :

另外一部分特殊的过滤器继承自CompareFilter, 需要用户提供至少两个特定的参数。比较运算符与比较器

CompareFilter 过滤器主要比基类FilterBase 多了一个 compare() 方法，可用值已经列在底下：

CompareFilter 需要的比较运算符可选项

比较运算符可选项

LESS

匹配小于设定的值

LESS_OR_EQUAL

匹配小于或等于设定值的值

EQUAL

匹配等于设定值的值

NOT_EQUAL

匹配与设定值不相等的值

GREATER_OR_EQUAL

匹配大于或等于设定值的值

GREATER

匹配大于设定值的值

NO_OP

排除一切值

CompareFilter 需要的比较器可选项

比较器提供了多种方法来比较不同的键值。比较器都继承自WritableByteArrayCompareable, WritableByteArrayComparable

实现了Writable 和 Comparable 接口。

BinaryComparator
 使用Bytes.compareTo() 比较当前值与阈值
 
BinaryPrefixComparator
  与上面的相似，使用Bytes.compareTo()进行匹配，但是是从左端开始前缀匹配
  
NullComparator
  不做匹配，只判断当前值是不是null

BitComparator
  通过BitwiseOp类提供的按位与(AND),或(OR),异或(XOR)操作执行位级比较
  
RegexStringComparator
  根据一个正则表达式，在实例化这个比较器的时候去匹配表中的数据
  
SubstringComparator
  把阈值和表中数据当作String实例，同时通过 contains() 操作匹配字符串。
  
BitComparator，RegexStringComparator，SubstringComparator 这3种比较器，只能与EQUAL,NOT_EQUAL运算符搭配使用。因为这些比较器的compareTo()方法匹配时返回0，不匹配时返回1。

Tips: 基于字符串的比较器如 RegexStringComparator ，SubstringComparator 比基于字节的比较器更慢，更消耗资。因为每次比较时它们都需要将给定的值转换为String, 截取字符串子串和正则式的处理也需要花费额外的时间。

基于CompareFilter 相关的Filter 子类

1.行过滤器

RowFilter

RowFilter 基于行键来过滤数据。会返回符合过滤器条件的行键，同时会过滤不符合条件的行键。

2.列族过滤器

FamilyFilter

FamilyFilter 通过比较列族来返回结果

3.列名过滤器

QualifierFilter

QualifierFilter 通过列名进行筛选

4.值过滤器

ValueFilter

ValueFilter 可以帮助用户，筛选某个特定值的单元格

5.参考列过滤器

DependentColumnFilter 不仅仅简单地通过用户指定的信息筛选数据。这种过滤器允许用户指定一个参考列或者是引用列，并使用参考列控制其他列的过滤。

综合示例代码

package hbase_2;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.CompareOperator;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.*;
import org.apache.hadoop.hbase.replication.BaseWALEntryFilter;
import org.apache.hadoop.hbase.util.Bytes;

import java.util.ArrayList;
import java.util.List;

/**
 * Created by szh on 2020/4/23.
 * @author szh
 */
public class HBase_MultiCompareFilter {

    public static void main(String[] args) throws Exception{

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "cdh-manager,cdh-node1,cdh-node2");
        conf.set("hbase.zookeeper.property.clientPort", "2181");

        Connection conn = ConnectionFactory.createConnection(conf);
        Admin admin = conn.getAdmin();

        // ========  基本信息 =========
        // 创建表(包含多个列簇)
        TableName tableName = TableName.valueOf("test3");
        String[] columnFamilys = {"article", "author"};

        Table table = conn.getTable(tableName);

        Filter rowFilter = new RowFilter(CompareOperator.EQUAL, new BinaryComparator(Bytes.toBytes("ce_shi1")) );
        ResultScanner resultScanner = table.getScanner(new Scan().setFilter(rowFilter));
        for(Result result : resultScanner){
            System.out.println(result);
        }

        System.out.println("========================================");
        System.out.println("========================================");
        System.out.println("========================================");


        Filter familyFilter = new FamilyFilter(CompareOperator.EQUAL,new BinaryPrefixComparator(Bytes.toBytes("arti")));
        ResultScanner resultScanner2 = table.getScanner(new Scan().setFilter(familyFilter));
        for(Result result : resultScanner2){
            System.out.println(result);
        }

        System.out.println("========================================");
        System.out.println("========================================");
        System.out.println("========================================");

        Filter valueFilter = new ValueFilter(CompareOperator.NOT_EQUAL, new SubstringComparator("hadoop"));
        ResultScanner resultScanner3 = table.getScanner(new Scan().setFilter(valueFilter));
        for(Result result : resultScanner3){
            System.out.println(result);
        }


        table.close();
    }

}

========================================

基于FilterBase 相关的Filter 子类

这些过滤器限制的比 CompareFilter 苛刻。

1.单值列过滤器

SingleColumnValueFilter

SingleColumnValueFilter ：适用如下情况，用一列的值决定是否一行数据被过滤。首先设定待检查的列，然后设置待检查的列的对应值。

/**
   * Constructor for binary compare of the value of a single column.  If the
   * column is found and the condition passes, all columns of the row will be
   * emitted.  If the condition fails, the row will not be emitted.
   * <p>
   * Use the filterIfColumnMissing flag to set whether the rest of the columns
   * in a row will be emitted if the specified column to check is not found in
   * the row.
   *
   * @param family name of column family
   * @param qualifier name of column qualifier
   * @param compareOp operator
   * @param value value to compare column values against
   * @deprecated Since 2.0.0. Will be removed in 3.0.0. Use
   * {@link #SingleColumnValueFilter(byte[], byte[], CompareOperator, byte[])} instead.
   */
  @Deprecated
  public SingleColumnValueFilter(final byte [] family, final byte [] qualifier,
      final CompareOp compareOp, final byte[] value) {
    this(family, qualifier, CompareOperator.valueOf(compareOp.name()),
      new org.apache.hadoop.hbase.filter.BinaryComparator(value));
  }

  /**
   * Constructor for binary compare of the value of a single column.  If the
   * column is found and the condition passes, all columns of the row will be
   * emitted.  If the condition fails, the row will not be emitted.
   * <p>
   * Use the filterIfColumnMissing flag to set whether the rest of the columns
   * in a row will be emitted if the specified column to check is not found in
   * the row.
   *
   * @param family name of column family
   * @param qualifier name of column qualifier
   * @param op operator
   * @param value value to compare column values against
   */
  public SingleColumnValueFilter(final byte [] family, final byte [] qualifier,
                                 final CompareOperator op, final byte[] value) {
    this(family, qualifier, op,
      new org.apache.hadoop.hbase.filter.BinaryComparator(value));
  }

  /**
   * Constructor for binary compare of the value of a single column.  If the
   * column is found and the condition passes, all columns of the row will be
   * emitted.  If the condition fails, the row will not be emitted.
   * <p>
   * Use the filterIfColumnMissing flag to set whether the rest of the columns
   * in a row will be emitted if the specified column to check is not found in
   * the row.
   *
   * @param family name of column family
   * @param qualifier name of column qualifier
   * @param compareOp operator
   * @param comparator Comparator to use.
   * @deprecated Since 2.0.0. Will be removed in 3.0.0. Use
   * {@link #SingleColumnValueFilter(byte[], byte[], CompareOperator, ByteArrayComparable)} instead.
   */
  @Deprecated
  public SingleColumnValueFilter(final byte [] family, final byte [] qualifier,
      final CompareOp compareOp,
      final org.apache.hadoop.hbase.filter.ByteArrayComparable comparator) {
    this(family, qualifier, CompareOperator.valueOf(compareOp.name()), comparator);
  }

  /**
   * Constructor for binary compare of the value of a single column.  If the
   * column is found and the condition passes, all columns of the row will be
   * emitted.  If the condition fails, the row will not be emitted.
   * <p>
   * Use the filterIfColumnMissing flag to set whether the rest of the columns
   * in a row will be emitted if the specified column to check is not found in
   * the row.
   *
   * @param family name of column family
   * @param qualifier name of column qualifier
   * @param op operator
   * @param comparator Comparator to use.
   */
  public SingleColumnValueFilter(final byte [] family, final byte [] qualifier,
      final CompareOperator op,
      final org.apache.hadoop.hbase.filter.ByteArrayComparable comparator) {
    this.columnFamily = family;
    this.columnQualifier = qualifier;
    this.op = op;
    this.comparator = comparator;
  }

  /**
   * Constructor for protobuf deserialization only.
   * @param family
   * @param qualifier
   * @param compareOp
   * @param comparator
   * @param filterIfMissing
   * @param latestVersionOnly
   * @deprecated Since 2.0.0. Will be removed in 3.0.0. Use
   * {@link #SingleColumnValueFilter(byte[], byte[], CompareOperator, ByteArrayComparable,
   *   boolean, boolean)} instead.
   */
  @Deprecated
  protected SingleColumnValueFilter(final byte[] family, final byte[] qualifier,
      final CompareOp compareOp, org.apache.hadoop.hbase.filter.ByteArrayComparable comparator,
      final boolean filterIfMissing,
      final boolean latestVersionOnly) {
    this(family, qualifier, CompareOperator.valueOf(compareOp.name()), comparator, filterIfMissing,
      latestVersionOnly);
  }

  /**
   * Constructor for protobuf deserialization only.
   * @param family
   * @param qualifier
   * @param op
   * @param comparator
   * @param filterIfMissing
   * @param latestVersionOnly
   */
  protected SingleColumnValueFilter(final byte[] family, final byte[] qualifier,
      final CompareOperator op, org.apache.hadoop.hbase.filter.ByteArrayComparable comparator,
       final boolean filterIfMissing, final boolean latestVersionOnly) {
    this(family, qualifier, op, comparator);
    this.filterIfMissing = filterIfMissing;
    this.latestVersionOnly = latestVersionOnly;
  }

2.单列排除过滤器

（SingleColumnValueExcludeFilter）

单列排除过滤器继承自SingleColumnValueFilter, 经过扩展后提供一种略微不同的语义：参考列不被包括到结果中。

3.前缀过滤器

（PrefixFilter）

public PrefixFilter(final byte [] prefix)

在构造当前过滤器时传入一个前缀，所有与前缀匹配的行都会被返回到客户端。

4.分页过滤器

PageFilter

用户可以使用这个过滤器对结果按行分页。当用户创建当前过滤器实例时需要指定pageSize参数，这个参数可以控制每页返回的行数。

分页时对一次返回的行数设定了严格的限制，一次扫描所覆盖的行数很有可能是多于分页大小的，一旦这种情况发生，过滤器有一种机制通知 region 服务器停止扫描。

public PageFilter(final long pageSize)

5.行键过滤器

KeyOnlyFilter

在一些应用中只需要将结果中KeyValue 实例的键返回，而不需要返回实际的数据。

该过滤器通过 KeyOnlyFilter.convertToKeyOnly(boolean) 方法帮助调用只返回键不返回值。

public KeyOnlyFilter() { this(false); }
public KeyOnlyFilter(boolean lenAsVal) { this.lenAsVal = lenAsVal; }

6.首次行键过滤器

FirstKeyOnlyFilter

用户需要访问一行中的第一列（HBase 隐式排序），则这种过滤其可以满足需求。这种过滤器通常在 行数统计（row counter）的应用场景中使用，这种场景只需要检查这一行是否存在。

在列式存储数据库中如果某一行存在，则行中必有列。

7.包含结束的过滤器

InclusiveStopFilter

扫描操作中的开始行被包含到结果中，但终止行被排除在外。使用这个过滤器时，用户也可以将结束行包括到结果中。

public InclusiveStopFilter(final byte [] stopRowKey) {
  this.stopRowKey = stopRowKey;
}

8.时间戳过滤器

TimestampsFilter

当用户需要在扫描结果中对版本进行细粒度的控制时，这个过滤器可以满足需求。用户需要出安茹一个装载了时间戳的List 实例。

/**
* Constructor for filter that retains only the specified timestamps in the list.
* @param timestamps
*/
public TimestampsFilter(List<Long> timestamps) {
this(timestamps, false);
}

/**
* Constructor for filter that retains only those
* cells whose timestamp (version) is in the specified
* list of timestamps.
*
* @param timestamps list of timestamps that are wanted.
* @param canHint should the filter provide a seek hint? This can skip
* past delete tombstones, so it should only be used when that
* is not an issue ( no deletes, or don't care if data
* becomes visible)
*/
public TimestampsFilter(List<Long> timestamps, boolean canHint) {
for (Long timestamp : timestamps) {
Preconditions.checkArgument(timestamp >= 0, "must be positive %s", timestamp);
}
this.canHint = canHint;
this.timestamps = new TreeSet<>(timestamps);
init();
}

9.列计数过滤器

ColumnCountGetFilter

ColumnCountGetFilter 过滤器用来限制每行最多取回多少列。当一行的列数达到设定的最大值时，这个过滤器会停止整个扫描操作。所以它不太适合扫描操作，反而比较适合在 get() 方法中使用。

public ColumnCountGetFilter(final int n) {
  Preconditions.checkArgument(n >= 0, "limit be positive %s", n);
  this.limit = n;
}

10.列分页过滤器

ColumnPaginationFilter

与PageFilter 相似，这个过滤器可以对一行的所有列进行分页。它将跳过所有偏移量小于 offset 的列，并包括之后所有偏移量在limit 之前（包括limit ）的列。

/**
 * Initializes filter with an integer offset and limit. The offset is arrived at
 * scanning sequentially and skipping entries. @limit number of columns are
 * then retrieved. If multiple column families are involved, the columns may be spread
 * across them.
 *
 * @param limit Max number of columns to return.
 * @param offset The integer offset where to start pagination.
 */
public ColumnPaginationFilter(final int limit, final int offset)

Filter s = new ColumnPaginationFilter(5,15);

上述代码会在 offset 15 处开始，扫描5行数据。

11.列前缀过滤器

ColumnPrefixFilter , 这个过滤器通过对列名称进行前缀匹配过滤，用户需要指定一个前缀来创建过滤器。

public ColumnPrefixFilter(final byte [] prefix) {
  this.prefix = prefix;
}

12.随机行过滤器

RandomRowFilter

有一种过滤器可以让结果中包含随机行。构造函数需要传入参数chance, chance 取值区间在 0.0 到 1.0 之间。

如果用户chance赋予一个负值，会导致所有结果都被过滤掉。相反，如果chance 大于1.0，则结果集中包含所有行。

/**
 * Create a new filter with a specified chance for a row to be included.
 * 
 * @param chance
 */
public RandomRowFilter(float chance) {
  this.chance = chance;
}

综合示例代码

测试数据

hbase(main):010:0> scan 'test3',{VERSIONS=>3}
ROW                            COLUMN+CELL                                                                           
 ce_shi1                       column=article:title, timestamp=1587656211263, value=2                                
 ce_shi1                       column=author:name, timestamp=1587558488841, value=zhouyuqin                          
 ce_shi1                       column=author:name, timestamp=1587402132957, value=nicholas                           
 ce_shi1                       column=author:name, timestamp=1587402040153, value=nicholas                           
 ce_shi1                       column=author:nickname, timestamp=1587402040153, value=lee                            
 ce_shi2                       column=author:name, timestamp=1587402132957, value=spark                              
 ce_shi2                       column=author:name, timestamp=1587402040153, value=spark                              
 ce_shi2                       column=author:nickname, timestamp=1587402132957, value=hadoop                         
 ce_shi2                       column=author:nickname, timestamp=1587402040153, value=hadoop                         
 ce_shi3                       column=author:age, timestamp=1587558488841, value=12                                  
 ce_shi3                       column=author:name, timestamp=1587558488841, value=sunzhenhua                         
 test33                        column=author:name, timestamp=1587402133000, value=sunzhenhua                         
 test33                        column=author:name, timestamp=1587402040188, value=sunzhenhua                         
 test33                        column=author:name, timestamp=1587400015581, value=sunzhenhua                         
4 row(s)
Took 0.0753 seconds

代码

package hbase_2;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.CompareOperator;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.filter.*;
import org.apache.hadoop.hbase.util.Bytes;

/**
 * Created by szh on 2020/4/23.
 * @author szh
 */
public class Hbase_MultiFilterBase {

    public static void main(String[] args) throws Exception{

        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", "cdh-manager,cdh-node1,cdh-node2");
        conf.set("hbase.zookeeper.property.clientPort", "2181");

        Connection conn = ConnectionFactory.createConnection(conf);
        Admin admin = conn.getAdmin();

        // ========  基本信息 =========
        // 创建表(包含多个列簇)
        TableName tableName = TableName.valueOf("test3");
        String[] columnFamilys = {"article", "author"};

        Table table = conn.getTable(tableName);

        Filter a1 = new SingleColumnValueFilter(
                Bytes.toBytes("author")
                ,Bytes.toBytes("name")
                ,CompareOperator.NOT_EQUAL
                ,Bytes.toBytes("zhouyuqin")
                );
        ResultScanner sc = table.getScanner(
                new Scan().setFilter(a1));
        for(Result result : sc){
            System.out.println(result);
        }

        System.out.println("=================================");
        System.out.println("=================================");
        System.out.println("=================================");

        Filter a2 = new KeyOnlyFilter();
        ResultScanner sc2 = table.getScanner(
                new Scan().setFilter(a2));
        for(Result result : sc2){
            System.out.println(result);
        }

        System.out.println("=================================");
        System.out.println("=================================");
        System.out.println("=================================");

        Filter a3 = new RandomRowFilter(0.5f);
        ResultScanner sc3 = table.getScanner(
                new Scan().setFilter(a3));
        for(Result result : sc3){
            System.out.println(result);
        }


        System.out.println("=================================");
        System.out.println("=================================");
        System.out.println("=================================");

        Filter a4 = new ColumnPrefixFilter(Bytes.toBytes("nick"));
        ResultScanner sc4 = table.getScanner(
                new Scan().setFilter(a4));
        for(Result result : sc4){
            System.out.println(result);
        }

    }

}

输出

keyvalues={ce_shi2/author:name/1587402132957/Put/vlen=5/seqid=0, ce_shi2/author:nickname/1587402132957/Put/vlen=6/seqid=0}
keyvalues={ce_shi3/author:age/1587558488841/Put/vlen=2/seqid=0, ce_shi3/author:name/1587558488841/Put/vlen=10/seqid=0}
keyvalues={test33/author:name/1587402133000/Put/vlen=10/seqid=0}
=================================
=================================
=================================
keyvalues={ce_shi1/article:title/1587656211263/Put/vlen=0/seqid=0, ce_shi1/author:name/1587558488841/Put/vlen=0/seqid=0, ce_shi1/author:nickname/1587402040153/Put/vlen=0/seqid=0}
keyvalues={ce_shi2/author:name/1587402132957/Put/vlen=0/seqid=0, ce_shi2/author:nickname/1587402132957/Put/vlen=0/seqid=0}
keyvalues={ce_shi3/author:age/1587558488841/Put/vlen=0/seqid=0, ce_shi3/author:name/1587558488841/Put/vlen=0/seqid=0}
keyvalues={test33/author:name/1587402133000/Put/vlen=0/seqid=0}
=================================
=================================
=================================
keyvalues={test33/author:name/1587402133000/Put/vlen=10/seqid=0}
=================================
=================================
=================================
keyvalues={ce_shi1/author:nickname/1587402040153/Put/vlen=3/seqid=0}
keyvalues={ce_shi2/author:nickname/1587402132957/Put/vlen=6/seqid=0}

======================================================

附加过滤器

一些过滤器不依赖这些过滤器本身，但是可以应用到其他过滤器，这正是附加过滤器（decorating filter）想要提供的功能。

跳转过滤器

SkipFilter

这个过滤器包装了一个用户提供的过滤器，当被包装的过滤器遇到一个需要的过滤的KeyValue实例时，用户可以扩展并过滤整行数据。

全匹配过滤器

WhileMatchFilter

这个附加过滤器与之前的附加过滤器相似，不过当一条数据被过滤掉时，它就会直接放弃这次扫描操作。

==================================================

实际应用中，用户可能需要多个过滤器共同限制返回客户端的结果，FilterList (过滤器列表) 提供了这项功能。

简单来说，就是上面的过滤器都能实现单一的过滤器功能。当我们需要多种过滤器组合的时候，我们就需要用到 FilterList 了

。

FilterList(List<Filter> rowFilters)
FilterList(Operator operator)
FilterList(Operator operator, List<Filter> rowFilters)

参数 operator 决定了组合它们的结果。下面提供了一些可选的操作符，默认值是 MUST_PASS_ALL

FilterList.Operator 的可选枚举值

MUST_PASS_ALL 当所有过滤器都允许包含这个值时，这个值才会包含在结果中，也就是说没有过滤器会忽略这个值。

MUST_PASS_ONE 只有一个过滤器允许包括这个值，那这个值就会包含在结果中。

创建 FilterList 实例后，可以用下面方法添加过滤器

void addFilter(Filter filter)

注意：

使用ArrayList 可以保证过滤器的执行顺序，与它们的添加顺序一致。

======================

自定义过滤器

用户可以实现Filter接口或者直接继承 FilterBase 类，后者已经为接口中的所有成员方法提供了默认实现。

接口中有一个公有的枚举类型，叫做 ReturnCode, 它被 filterKeyValue() 用于通知执行框架，进而决定如何执行下一步工作。过滤器可以跳过一个值，一列的剩余部分或者一行的剩余部分，而不用遍历所有数据。因此获取数据的效率会大大提升。

Filter.ReturnCode的值类型

INCLUDE

在结果中包括这个KeyValue 实例

SKIP

跳过这个KeyValue 实例，并继续处理接下来的工作

NEXT_COL

跳过当前列并处理后面的列。例如，TimestampsFilter 使用了这个返回值。

NEXT_ROW

与上面的行为相似，跳过当前行并继续处理下一行。例如，RowFilter 使用了这个返回值。

SEEK_NEXT_USING_HINT

一些过滤器需要跳过一系列的值，此时需要使用这个返回值通知执行框架使用getNextKeyHint() 来决定跳到什么位置。例如，ColumnPrefixFilter 使用了这个功能。