HBase—过滤器

使用过滤器可以提高操作表的效率,HBase中两种数据读取函数get()和scan()都支持过滤器,支持直接访问和通过指定起始行键来访问,但是缺少细粒度的筛选功能,如基于正则表达式对行键或值进行筛选的功能。

可以使用预定义好的过滤器或者是实现自定义过滤器。

过滤器在客户端创建,通过RPC传送到服务器端,在服务器端执行过滤操作,把数据返回给客户端。



Comparison Filter一般有两个参数:比较运算符和比较器

比较运算符(CompareOp):

EQUAL        相等
GREATER        大于
GREATER_OR_EQUAL        大于等于
LESS        小于
LESS_OR_EQUAL        小于等于
NOT_EQUAL        不等于

比较器(Comparator):

BinaryComparator        匹配完整字节数组
BinaryPrefixComparator    匹配字节数组前缀
BitComparator        Does not compare against an actual value but whether a given one is null, or not null.
NullComparator        Does not compare against an actual value but whether a given one is null, or not null.
RegexStringComparator        正则表达式匹配
SubstringComparator        子串匹配


1. Comparison Filter


下面先来查询下“blog”表的所有记录:
hbase(main):015:0> scan 'blog'
ROW                    COLUMN+CELL                                                   
 row1                  column=article:title, timestamp=1457150148590, value=hadoop   
 row1                  column=author:name, timestamp=1457150148645, value=tom        
 row1                  column=author:nickname, timestamp=1457150148692, value=tt     
 row2                  column=article:title, timestamp=1457161619563, value=hive     
 row2                  column=author:name, timestamp=1457161619593, value=kitty      
 row2                  column=author:nickname, timestamp=1457161619647, value=kk     
 row3                  column=article:title, timestamp=1457161619680, value=hbase    
 row3                  column=author:name, timestamp=1457161619695, value=jerry      
 row3                  column=author:nickname, timestamp=1457161619702, value=jj     
 row4                  column=article:title, timestamp=1457161619709, value=sqoop    
 row4                  column=author:name, timestamp=1457161619717, value=ken        
 row4                  column=author:nickname, timestamp=1457161619726, value=kk     
4 row(s) in 0.1730 seconds

1.1 RowFilter

    基于rowkey来过滤数据

构造函数:

public RowFilter(final CompareOp rowCompareOp, final ByteArrayComparable rowComparator)

实测代码:

		Scan scan = new Scan();
		// 指定扫描author列族的anme列
		scan.addColumn(Bytes.toBytes("author"), Bytes.toBytes("name"));
		
		//选出所匹配(行键值小于等于“row3”)的行
		Filter filter = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL,
				new BinaryComparator(Bytes.toBytes("row3")));
		scan.setFilter(filter);

		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}
		scanner.close();
		table.close();


结果:

keyvalues={row1/author:name/1457150148645/Put/vlen=3/mvcc=0}
keyvalues={row2/author:name/1457161619593/Put/vlen=5/mvcc=0}
keyvalues={row3/author:name/1457161619695/Put/vlen=5/mvcc=0}

1.2 FamilyFilter

    基于列族的过滤

构造函数:
public FamilyFilter(final CompareOp familyCompareOp, final ByteArrayComparable familyComparator)

		Scan scan = new Scan();

		// 选出所匹配(列族名称小于“author3”)的列族
		Filter filter = new FamilyFilter(CompareOp.LESS, new BinaryComparator(
				Bytes.toBytes("author")));

		scan.setFilter(filter);

		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}
		scanner.close();
		table.close();


结果:

keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0}

1.3 QualifierFilter

    基于列的过滤

构造函数:

public QualifierFilter(final CompareOp op,final ByteArrayComparable qualifierComparator)

		Scan scan = new Scan();

		// 选出所匹配(列名称大于“name”)的列
		Filter filter = new QualifierFilter(CompareOp.GREATER, new BinaryComparator(
				Bytes.toBytes("name")));
		scan.setFilter(filter);

		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}
		scanner.close();
		table.close();

结果:

keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}

1.4 ValueFilter

    基于值的过滤

构造函数:

public ValueFilter(final CompareOp valueCompareOp,final ByteArrayComparable valueComparator)

		Scan scan = new Scan();

		// 选出所匹配(值为“kk”)的记录
		Filter filter = new ValueFilter(CompareOp.EQUAL, new BinaryComparator(
				Bytes.toBytes("kk")));
		scan.setFilter(filter);

		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}
		scanner.close();
		table.close();

结果:

keyvalues={row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}

1.5 DependentColumnFilter

    该过滤器有两个参数 —— 列族和列修饰。 尝试找到该列所在的每一行,并返回该行具有相同时间戳的全部键值对。如果某一行不包含指定的列,则该行的任何键值对都不返回。

    该过滤器还可以有一个可选布尔参数 —— dropDependentColumn. 如果为true, 从属的列不返回。

    该过滤器还可以有两个可选参数 —— 一个比较操作符和一个值比较器,用于列族和修饰的进一步检查。如果从属的列找到,其值还必须通过值检查,然后就是时间戳必须考虑。


2. Dedicated Filters

2.1 SingleColumnValueFiler

SingleColumnValueFilter

    如果一个列满足条件就返回一行。

2.2 SingleColumnValueExcludeFilter

    结果中不包含referencecolumn

2.3 PrefixFilter

    所有匹配前缀的行都会返回

		Scan scan = new Scan();
		//指定扫描author列族的name列
		scan.addColumn(Bytes.toBytes("author"), Bytes.toBytes("name"));
		// 选出所匹配(前缀为“row1”的行)的行
		Filter filter = new PrefixFilter(Bytes.toBytes("row1"));
		scan.setFilter(filter);

		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}
		scanner.close();
		table.close();

结果:

keyvalues={row1/author:name/1457150148645/Put/vlen=3/mvcc=0}

2.4 PageFilter

    页过滤,通过设置pagesize参数可以返回每一页page的数量。需要注意的是,客户端需要记住上一次访问的row的key值。

构造函数:

public PageFilter(final long pageSize) 


Filter filter = new PageFilter(2);
		final byte[] POSTFIX = new byte[] { 0x00 };
		byte[] lastRow = null;
		int totalRows = 0;
		while (true) {
			Scan scan = new Scan();
			scan.setFilter(filter);
			if (lastRow != null) {
				// 注意这里添加了POSTFIX操作,不然死循环了
				byte[] startRow = Bytes.add(lastRow, POSTFIX);
				System.out.println("start row: "
						+ Bytes.toStringBinary(startRow));
				scan.setStartRow(startRow);
			}
			ResultScanner scanner = table.getScanner(scan);
			
			int localRows = 0;
			Result result;
			while ((result = scanner.next()) != null) {
				System.out.println(localRows++ + ":" + result);
				totalRows++;
				lastRow = result.getRow();
			}
			scanner.close();
			if (localRows == 0)
				break;
		}
		System.out.println("total rows:" + totalRows);

结果:

0:keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0, row1/author:name/1457150148645/Put/vlen=3/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
1:keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0, row2/author:name/1457161619593/Put/vlen=5/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
start row: row2\x00
0:keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0, row3/author:name/1457161619695/Put/vlen=5/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
1:keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0, row4/author:name/1457161619717/Put/vlen=3/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}
start row: row4\x00
total rows:4

2.5 KeyOnlyFilter

    只返回每行的行键,值全部为空,这对于只关注于行键的应用场景来说非常合适,这样忽略掉其值就可以减少传递到客户端的数据量,能起到一定的优化作用。

构造函数:

public KeyOnlyFilter(boolean lenAsVal)

默认lenAsVal为false,表示不会把value的长度作为输出的value

		Scan scan = new Scan();
		Filter filter = new KeyOnlyFilter(false);
		scan.setFilter(filter);
		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}


结果:

keyvalues={row1/article:title/1457150148590/Put/vlen=0/mvcc=0, row1/author:name/1457150148645/Put/vlen=0/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=0/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=0/mvcc=0, row2/author:name/1457161619593/Put/vlen=0/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=0/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=0/mvcc=0, row3/author:name/1457161619695/Put/vlen=0/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=0/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=0/mvcc=0, row4/author:name/1457161619717/Put/vlen=0/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=0/mvcc=0}


发现value都为0

2.6 FirstKeyOnlyFilter

    返回的结果集中只包含第一列的数据,所以进行count,sum操作等集合操作的时候,使用FirstKeyOnlyFilter会带来性能上的提升。

		Scan scan = new Scan();
		Filter filter = new FirstKeyOnlyFilter();
		scan.setFilter(filter);
		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}


结果:

keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0}

2.7 InclusiveStopFilter

    扫描的时候,我们可以设置一个开始行键和一个终止行键,默认情况下,这个行键的返回是前闭后开区间,即包含起始行,单不包含中指行,如果我们想要同时包含起始行和终止行,那么我们可以使用此过滤器。

		Scan scan = new Scan();
		scan.addColumn(Bytes.toBytes("article"), Bytes.toBytes("title"));
		scan.setStartRow(Bytes.toBytes("row1"));
		// scan.setStopRow(Bytes.toBytes("row2"));
		Filter filter = new InclusiveStopFilter(Bytes.toBytes("row2"));
		scan.setFilter(filter);
		ResultScanner scanner = table.getScanner(scan);
		for (Result res : scanner) {
			System.out.println(res);
		}

结果:

keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}

2.8 TimestampsFilter

    需要输出指定版本的数据时,可以考虑使用TimestampsFilter

构造方法:

public TimestampsFilter(List<Long> timestamps)

2.9 ColumnCountGetFilter

    说明每行最大的column数量,如果发现一行匹配最大数量则停止整个scan。它对get十分有用。

构造方法:

public ColumnCountGetFilter(final int n)

2.10 ColumnCountGetFilter

    只显示[limit,offset>的列的数据

构造方法:

public ColumnPaginationFilter(final int limit, final int offset)

Filter filter = new ColumnPaginationFilter(1,2);
		
		Scan scan = new Scan();
		scan.setFilter(filter);
		ResultScanner scanner = table.getScanner(scan);
		for (Result result : scanner) {
			System.out.println(result);
		}

结果:

keyvalues={row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
keyvalues={row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
keyvalues={row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}

2.11 ColumnPrefixFilter

    跟prefxiFilter相似,只是改成了Column

2.12 RandomRowFilter

    随即的返回row的数据。

构造函数:

public RandomRowFilter(float chance)

chance取值为0到1.0,如果<0则为空,如果>1则包含所有的行。

3. Decorating Filters

3.1 SkipFilter

   

3.2 WhileMatchFilter

    一旦遇到过滤掉一个row或者column,scan停止


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值