使用过滤器可以提高操作表的效率,HBase中两种数据读取函数get()和scan()都支持过滤器,支持直接访问和通过指定起始行键来访问,但是缺少细粒度的筛选功能,如基于正则表达式对行键或值进行筛选的功能。
可以使用预定义好的过滤器或者是实现自定义过滤器。
过滤器在客户端创建,通过RPC传送到服务器端,在服务器端执行过滤操作,把数据返回给客户端。
Comparison Filter一般有两个参数:比较运算符和比较器
比较运算符(CompareOp):
EQUAL 相等
GREATER 大于
GREATER_OR_EQUAL 大于等于
LESS 小于
LESS_OR_EQUAL 小于等于
NOT_EQUAL 不等于
比较器(Comparator):
BinaryComparator 匹配完整字节数组
BinaryPrefixComparator 匹配字节数组前缀
BitComparator Does not compare against an actual value but whether a given one is null, or not null.
NullComparator Does not compare against an actual value but whether a given one is null, or not null.
RegexStringComparator 正则表达式匹配
SubstringComparator 子串匹配
1. Comparison Filter
下面先来查询下“blog”表的所有记录:
hbase(main):015:0> scan 'blog'
ROW COLUMN+CELL
row1 column=article:title, timestamp=1457150148590, value=hadoop
row1 column=author:name, timestamp=1457150148645, value=tom
row1 column=author:nickname, timestamp=1457150148692, value=tt
row2 column=article:title, timestamp=1457161619563, value=hive
row2 column=author:name, timestamp=1457161619593, value=kitty
row2 column=author:nickname, timestamp=1457161619647, value=kk
row3 column=article:title, timestamp=1457161619680, value=hbase
row3 column=author:name, timestamp=1457161619695, value=jerry
row3 column=author:nickname, timestamp=1457161619702, value=jj
row4 column=article:title, timestamp=1457161619709, value=sqoop
row4 column=author:name, timestamp=1457161619717, value=ken
row4 column=author:nickname, timestamp=1457161619726, value=kk
4 row(s) in 0.1730 seconds
1.1 RowFilter
基于rowkey来过滤数据
构造函数:
public RowFilter(final CompareOp rowCompareOp, final ByteArrayComparable rowComparator)
实测代码:
Scan scan = new Scan();
// 指定扫描author列族的anme列
scan.addColumn(Bytes.toBytes("author"), Bytes.toBytes("name"));
//选出所匹配(行键值小于等于“row3”)的行
Filter filter = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL,
new BinaryComparator(Bytes.toBytes("row3")));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
scanner.close();
table.close();
结果:
keyvalues={row1/author:name/1457150148645/Put/vlen=3/mvcc=0}
keyvalues={row2/author:name/1457161619593/Put/vlen=5/mvcc=0}
keyvalues={row3/author:name/1457161619695/Put/vlen=5/mvcc=0}
1.2 FamilyFilter
基于列族的过滤
构造函数:public FamilyFilter(final CompareOp familyCompareOp, final ByteArrayComparable familyComparator)
Scan scan = new Scan();
// 选出所匹配(列族名称小于“author3”)的列族
Filter filter = new FamilyFilter(CompareOp.LESS, new BinaryComparator(
Bytes.toBytes("author")));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
scanner.close();
table.close();
结果:
keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0}
1.3 QualifierFilter
基于列的过滤
构造函数:
public QualifierFilter(final CompareOp op,final ByteArrayComparable qualifierComparator)
Scan scan = new Scan();
// 选出所匹配(列名称大于“name”)的列
Filter filter = new QualifierFilter(CompareOp.GREATER, new BinaryComparator(
Bytes.toBytes("name")));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
scanner.close();
table.close();
结果:
keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}
1.4 ValueFilter
基于值的过滤
构造函数:
public ValueFilter(final CompareOp valueCompareOp,final ByteArrayComparable valueComparator)
Scan scan = new Scan();
// 选出所匹配(值为“kk”)的记录
Filter filter = new ValueFilter(CompareOp.EQUAL, new BinaryComparator(
Bytes.toBytes("kk")));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
scanner.close();
table.close();
结果:
keyvalues={row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}
1.5 DependentColumnFilter
该过滤器有两个参数 —— 列族和列修饰。 尝试找到该列所在的每一行,并返回该行具有相同时间戳的全部键值对。如果某一行不包含指定的列,则该行的任何键值对都不返回。
该过滤器还可以有一个可选布尔参数 —— dropDependentColumn. 如果为true, 从属的列不返回。
该过滤器还可以有两个可选参数 —— 一个比较操作符和一个值比较器,用于列族和修饰的进一步检查。如果从属的列找到,其值还必须通过值检查,然后就是时间戳必须考虑。
2. Dedicated Filters
2.1 SingleColumnValueFiler
SingleColumnValueFilter
如果一个列满足条件就返回一行。
2.2 SingleColumnValueExcludeFilter
结果中不包含referencecolumn
2.3 PrefixFilter
所有匹配前缀的行都会返回
Scan scan = new Scan();
//指定扫描author列族的name列
scan.addColumn(Bytes.toBytes("author"), Bytes.toBytes("name"));
// 选出所匹配(前缀为“row1”的行)的行
Filter filter = new PrefixFilter(Bytes.toBytes("row1"));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
scanner.close();
table.close();
结果:
keyvalues={row1/author:name/1457150148645/Put/vlen=3/mvcc=0}
2.4 PageFilter
页过滤,通过设置pagesize参数可以返回每一页page的数量。需要注意的是,客户端需要记住上一次访问的row的key值。
构造函数:
public PageFilter(final long pageSize)
Filter filter = new PageFilter(2);
final byte[] POSTFIX = new byte[] { 0x00 };
byte[] lastRow = null;
int totalRows = 0;
while (true) {
Scan scan = new Scan();
scan.setFilter(filter);
if (lastRow != null) {
// 注意这里添加了POSTFIX操作,不然死循环了
byte[] startRow = Bytes.add(lastRow, POSTFIX);
System.out.println("start row: "
+ Bytes.toStringBinary(startRow));
scan.setStartRow(startRow);
}
ResultScanner scanner = table.getScanner(scan);
int localRows = 0;
Result result;
while ((result = scanner.next()) != null) {
System.out.println(localRows++ + ":" + result);
totalRows++;
lastRow = result.getRow();
}
scanner.close();
if (localRows == 0)
break;
}
System.out.println("total rows:" + totalRows);
结果:
0:keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0, row1/author:name/1457150148645/Put/vlen=3/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
1:keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0, row2/author:name/1457161619593/Put/vlen=5/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
start row: row2\x00
0:keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0, row3/author:name/1457161619695/Put/vlen=5/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
1:keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0, row4/author:name/1457161619717/Put/vlen=3/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}
start row: row4\x00
total rows:4
2.5 KeyOnlyFilter
只返回每行的行键,值全部为空,这对于只关注于行键的应用场景来说非常合适,这样忽略掉其值就可以减少传递到客户端的数据量,能起到一定的优化作用。
构造函数:
public KeyOnlyFilter(boolean lenAsVal)
默认lenAsVal为false,表示不会把value的长度作为输出的value
Scan scan = new Scan();
Filter filter = new KeyOnlyFilter(false);
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
结果:
keyvalues={row1/article:title/1457150148590/Put/vlen=0/mvcc=0, row1/author:name/1457150148645/Put/vlen=0/mvcc=0, row1/author:nickname/1457150148692/Put/vlen=0/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=0/mvcc=0, row2/author:name/1457161619593/Put/vlen=0/mvcc=0, row2/author:nickname/1457161619647/Put/vlen=0/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=0/mvcc=0, row3/author:name/1457161619695/Put/vlen=0/mvcc=0, row3/author:nickname/1457161619702/Put/vlen=0/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=0/mvcc=0, row4/author:name/1457161619717/Put/vlen=0/mvcc=0, row4/author:nickname/1457161619726/Put/vlen=0/mvcc=0}
发现value都为0
2.6 FirstKeyOnlyFilter
返回的结果集中只包含第一列的数据,所以进行count,sum操作等集合操作的时候,使用FirstKeyOnlyFilter会带来性能上的提升。
Scan scan = new Scan();
Filter filter = new FirstKeyOnlyFilter();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
结果:
keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}
keyvalues={row3/article:title/1457161619680/Put/vlen=5/mvcc=0}
keyvalues={row4/article:title/1457161619709/Put/vlen=5/mvcc=0}
2.7 InclusiveStopFilter
扫描的时候,我们可以设置一个开始行键和一个终止行键,默认情况下,这个行键的返回是前闭后开区间,即包含起始行,单不包含中指行,如果我们想要同时包含起始行和终止行,那么我们可以使用此过滤器。
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("article"), Bytes.toBytes("title"));
scan.setStartRow(Bytes.toBytes("row1"));
// scan.setStopRow(Bytes.toBytes("row2"));
Filter filter = new InclusiveStopFilter(Bytes.toBytes("row2"));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
结果:
keyvalues={row1/article:title/1457150148590/Put/vlen=6/mvcc=0}
keyvalues={row2/article:title/1457161619563/Put/vlen=4/mvcc=0}
2.8 TimestampsFilter
需要输出指定版本的数据时,可以考虑使用TimestampsFilter。
构造方法:
public TimestampsFilter(List<Long> timestamps)
2.9 ColumnCountGetFilter
说明每行最大的column数量,如果发现一行匹配最大数量则停止整个scan。它对get十分有用。
构造方法:
public ColumnCountGetFilter(final int n)
2.10 ColumnCountGetFilter
只显示[limit,offset>的列的数据
构造方法:
public ColumnPaginationFilter(final int limit, final int offset)
Filter filter = new ColumnPaginationFilter(1,2);
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
System.out.println(result);
}
结果:
keyvalues={row1/author:nickname/1457150148692/Put/vlen=2/mvcc=0}
keyvalues={row2/author:nickname/1457161619647/Put/vlen=2/mvcc=0}
keyvalues={row3/author:nickname/1457161619702/Put/vlen=2/mvcc=0}
keyvalues={row4/author:nickname/1457161619726/Put/vlen=2/mvcc=0}
2.11 ColumnPrefixFilter
跟prefxiFilter相似,只是改成了Column
2.12 RandomRowFilter
随即的返回row的数据。
构造函数:
public RandomRowFilter(float chance)
chance取值为0到1.0,如果<0则为空,如果>1则包含所有的行。
3. Decorating Filters
3.1 SkipFilter
3.2 WhileMatchFilter
一旦遇到过滤掉一个row或者column,scan停止