彷徨 | 20 种 HBase 过滤器介绍

最新推荐文章于 2024-07-07 00:00:00 发布

俊杰梓

最新推荐文章于 2024-07-07 00:00:00 发布

阅读量991

点赞数

分类专栏： HBase 大数据文章标签： HBase 大数据过滤器

本文链接：https://blog.csdn.net/weixin_35353187/article/details/82431885

版权

大数据同时被 2 个专栏收录

66 篇文章 19 订阅

订阅专栏

HBase

7 篇文章 1 订阅

订阅专栏

使用TScan过滤器进行过滤，20种过滤器进行介绍

英文介绍：

TScan 使用filter的用法

1 操作符合： The client should use the symbols (<, ⇐, =, !=, >, >=) to express compare operators

2修饰值符号：BinaryComparator - binary;BinaryPrefixComparator - binaryprefix;RegexStringComparator - regexstring;SubStringComparator - substring

Example ComparatorValues

1.binary:abc will match everything that is lexicographically greater than "abc"

2.binaryprefix:abc will match everything whose first 3 characters are lexicographically equal to "abc"

3.regexstring:ab*yz will match everything that doesn’t begin with "ab" and ends with "yz"

4.substring:abc123 will match everything that begins with the substring "abc123"

5 Individual Filter Syntax

3、过滤类型：

KeyOnlyFilter : This filter doesn’t take any arguments. It returns only the key component of each key-value.

FirstKeyOnlyFilter : This filter doesn’t take any arguments. It returns only the first key-value from each row.

PrefixFilter : This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

ColumnPrefixFilter : This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”.

MultipleColumnPrefixFilter : This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes.Each of the column prefixes must be of the form: “qualifier”.

ColumnCountGetFilter : This filter takes one argument – a limit. It returns the first limit number of columns in the table.

PageFilter : This filter takes one argument – a page size. It returns page size number of rows from the table.

ColumnPaginationFilter : This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows.

InclusiveStopFilter : This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.

TimeStampsFilter : This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps.

RowFilter : This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row.

Family Filter : This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column.

QualifierFilter : This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column.

ValueFilter : This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.

DependentColumnFilter : This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.

SingleColumnValueFilter : This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.

SingleColumnValueExcludeFilter : This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.

ColumnRangeFilter : This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.

个人中文测试

1 行健过滤器(RowFilter)

scan.setStartRow(hbaseService.wrap(startRowKey));

scan.setStopRow(hbaseService.wrap(endRowKey));

scan.setColumns(columnList);

filter = "RowFilter (=, 'binary:baiyc_20150701_0002') ";

2 字段值过滤器(ValueFilter)

filter = "ValueFilter (=, 'binary:33')"; 等号

filter = "ValueFilter (=, 'binaryprefix:baiyc')"; 前缀

filter = "ValueFilter (=, 'regexstring:baiyc*2')";匹配符

filter = "ValueFilter (=, 'substring:aiyc')"; 包含子串

filter = "(ValueFilter (=, 'substring:aiyc') OR ValueFilter (=, 'binaryprefix:baiyc'))"; 包含子串

3 字段过滤(QualifierFilter)

filter = "QualifierFilter(=,'substring:name')";

filter = "QualifierFilter(=,'binary:name')";

filter = "(QualifierFilter(=,'binary:name') OR QualifierFilter(=,'binary:age'))";

4 单列值过滤器（SingleColumnValueFilter）

单列值过滤器（SingleColumnValueFilter）用一列值决定是否一行数据被过滤

//选定列簇和某一列，然后与列的value相比，正确的返回全部的row，注意如果某一行不含有该列，同样返回，除非通过filterIfColumnMissing 设置成真。

//如果 filterIfColumnMissing 标志设为真，如果该行没有指定的列，那么该行的所有列将不发出。缺省值为假。

//如果setLatestVersionOnly 标志设为假，将检查此前的版本。缺省值为真。

filter = "SingleColumnValueFilter('base_info','name',>=,'binary:zhangsan_20150701_0000')"; 有数据

5 单列排除过滤器（SingleColumnValueExcludeFilter）

该过滤器继承SingleColumnValueFilter，参考列不会包含在结果中

该过滤器同上面的过滤器正好相反，如果条件相符，将不会返回该列的内容。

filter= "SingleColumnValueExcludeFilter('base_info','name',>=,'binary:baiyc_20150701_0002')";

6 行键前缀过滤器(PrefixFilter)

行键前缀过滤器(PrefixFilter)

filter = "PrefixFilter ('regexstring:baiyc_20150701*') ";

7 分页过滤器（PageFilter）

分页过滤器（PageFilter）作用：对结果按行分页。客户端需要记住上一次访问的row的key值。

filter = "PageFilter(12)"; 有数据

8 行键过滤器（KeyOnlyFilter）

行键过滤器（KeyOnlyFilter）只需要将结果中KeyValue实例的键返回，不需要返回实际的数据。

filter = "KeyOnlyFilter()"; 有数据

9 首次行键过滤器（FirstKeyOnlyFilter）

首次行键过滤器（FirstKeyOnlyFilter）只需要访问一行中的第一列。该过滤器常用在行数统计。

filter = "FirstKeyOnlyFilter()"; 有数据

10 包含结束的过滤器（InclusiveStopFilter）

包含结束的过滤器（InclusiveStopFilter）开始行被包含在结果中，但终止行被排斥在外，使用这个过滤器，也可以将结束行包含在结果中。

filter = "InclusiveStopFilter('binary:baiyc_20150701_0016')";

11 时间戳过滤器（TimestampsFilter）

时间戳过滤器（TimestampsFilter）需要在扫描结果中对版本进行细粒度控制。一个版本是指一个列在一个特定时间的值。

filter = "TimestampsFilter (1435747469212, 1435738500459) ";

12 列计数过滤器（ColumnCountGetFilter）

列计数过滤器（ColumnCountGetFilter）限制每行最多取回多少列。设置ColumnCountGetFilter(int n),它不适合扫描操作，更适合get（）。

filter = "ColumnCountGetFilter(3)";

13 列分页过滤器（ColumnPaginationFilter）

列分页过滤器（ColumnPaginationFilter）可以对一行中所有列进行分页。

ColumnPaginationFilter（int limit,int offset）,跳过所有偏移量小于offset的列，并包含之前所有偏移量在limit之前的列。

filter = "ColumnPaginationFilter(1,2)";

14 列前缀过滤器（ColumnPrefixFilter）

列前缀过滤器（ColumnPrefixFilter）对列名称前缀进行匹配。

filter = "ColumnPrefixFilter ('name')";有数据

filter = "ColumnPrefixFilter ('age')";有数据

15 行键值过滤器(ColumnRangeFilter)

行键值过滤器测试通过,有数据，过滤意义待核对

filter = "ColumnRangeFilter ('baiyc_20150701_0000', true, 'zhangsan_20150701_0015', false)"; 有全部的数据

16 单独查询某个字段值(DependentColumnFilter)

单独查询某个字段的值

//该过滤器有两个参数 —— 列族和列修饰。尝试找到该列所在的每一行，并返回该行具有相同时间戳的全部键值对。如果某一行不包含指定的列，则该行的任何键值对都不返回。

//该过滤器还可以有一个可选布尔参数 —— dropDependentColumn. 如果为true, 从属的列不返回。

//该过滤器还可以有两个可选参数 —— 一个比较操作符和一个值比较器，用于列族和修饰的进一步检查。如果从属的列找到，其值还必须通过值检查，然后就是时间戳必须考虑。

filter = "DependentColumnFilter('base_info','name')"; 有数据

filter = "DependentColumnFilter('base_info','age')"; 有数据

filter = "DependentColumnFilter('base_info','name') AND DependentColumnFilter('base_info','age')"; 有数据

17 列族过滤器(FamilyFilter)

列族过滤器(FamilyFilter)

filter = "FamilyFilter(=, 'binary:base_info')"; 有数据

filter = "FamilyFilter(=, 'binary:extra_info')"; 有数据

18 多个列前缀过滤器(MultipleColumnPrefixFilter)

多个列前缀过滤器(MultipleColumnPrefixFilter)

filter = "MultipleColumnPrefixFilter('name','age')";有数据

19 列式忽略过滤器(SkipFilter)

SkipFilter 这个过滤器只作用到keyValueFilter上。KeyValueFilter会返回所有满足条件的row及对应的列。

//而加上SkipFilter以后。会发现如果某一行的某一列不符合条件，则这一行全部不返回了，具体用法如下：

Filter filter1 = new ValueFilter(CompareFilter.CompareOp.NOT_EQUAL, new BinaryComparator(Bytes.toBytes("val-0")));

Filter filter2 = new SkipFilter(filter1);

20 随机行过滤器（RandomRowFilter）

随机行过滤器（RandomRowFilter）可以让结果中包含随机行。RandomRowFilter（float chance） Chance在0~1之间。

//filter = "RandomRowFilter(0.8)";

    @Override

    @Transactional(readOnly = false, propagation = Propagation.REQUIRED)

    public List<HBaseRow> scannerOpenWithScan(String table, TScan scan, Map<String, String> attributes) throws DataBaseException {


    List<HBaseRow> rowList = new ArrayList<HBaseRow>();

        ByteBuffer tableName = wrap(table);

        Map<ByteBuffer, ByteBuffer> wrappedAttributes = encodeAttributes(attributes);


        logger.info("scannerOpenWithScan start...");

    TTransport transport = null;

        int scanId = -1;

        try {

            //获取连接

            transport = new TSocket(host, port);

            TProtocol protocol = new TBinaryProtocol(transport, true, true);

            Hbase.Client client = new Hbase.Client(protocol);

            transport.open();

            logger.info("HBase连接开启...");

           

            scanId =  client.scannerOpenWithScan(tableName, scan, wrappedAttributes);

            logger.info("scanId:" + scanId);

            List<TRowResult> results =  client.scannerGetList(scanId, Const.HBASE_NUMBER_ROW);

            logger.info("results:" + results.size());

            int row = 0;

            while (results != null && !results.isEmpty()) {

                for (TRowResult result : results) {

                   if(row >=Const.HBASE_DISPLAY_ROW){

                       break;

                   }

                   this.iterateResults(tableName,result,rowList);

                   row++;

                }

                if(row >=Const.HBASE_DISPLAY_ROW){

                   break;

                }

                results = client.scannerGetList(scanId, Const.HBASE_NUMBER_ROW);

            }

            logger.info("rowList:" + rowList.size());

            client.scannerClose(scanId);
      

        } catch (TTransportException e) {

            throw new DataBaseException(e.getMessage(),e);

        } catch (IOError e) {

            throw new DataBaseException(e.getMessage(),e);

        } catch (TException e) {

            if (e instanceof TApplicationException && ((TApplicationException) e).getType() == TApplicationException.MISSING_RESULT) {  

            logger.info("The result of helloNull function is NULL");  

            }

            throw new DataBaseException(e.getMessage(),e);

        }

        /**

        catch (IOException e) {

            throw new DataBaseException(e.getMessage(),e);

        }

        */

        finally{

            if(transport!=null){

                transport.close();

                logger.info("HBase连接关闭. ");

            }

        }

        logger.info("scannerOpenWithScan end...");

        return rowList;

    }

6.4 testScan方法调用

    public void testScan(){

        String tableName = "user_info";

        try {         

            //12、查询数据信息

            //12.1全表查询

            TScan scan = new TScan();

            String filter  = "";

            tableName = "user_info";

            Map<String,String> attributesStr = new HashMap<String,String>();

            //可以相互组合进行查询

            filter  = "RowFilter (=, 'binary:baiyc_20150701_0002') AND ValueFilter (=, 'binaryprefix:baiyc')"; //前缀

            scan.setFilterString(getByteBuffer(filter));

            scan.setCaching(10);

            this.scannerOpenWithScan(tableName,scan,attributesStr);

        } catch (Exception e) {

            e.printStackTrace();

        }      

    }