对HBase bloom Filter的一些讨论

最新推荐文章于 2022-04-27 11:11:35 发布

mango_song

最新推荐文章于 2022-04-27 11:11:35 发布

阅读量1.6k

点赞数

分类专栏： HBase 文章标签： hbase hstore storefile bloomfilter bloom

HBase 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

在工作中大家对hbase的bloom filter是否能作用于scan展开讨论。在没讨论前，我还真没想过这个问题，想当然的以为bloom filter肯定可以为scan剔除掉不需要的hfile。但Google了下才发现事实不是如此！

首先，学习了以下2篇文章：

hbase对bf的理解和使用

http://zjushch.iteye.com/blog/1530143

hbase的主要代码提交者对hbase Bloomfilter的解释

http://blog.csdn.net/macyang/article/details/6182629

大概对BloomFilter有了一些了解，然后找到了hbase中对有bloomfilter的table查询的2个优化：

1.get操作会enable bloomfilter帮助剔除掉不会用到的Storefile

在scan初始化时（get会包装为scan）对于每个storefile会做shouldSeek的检查，如果返回false，则表明该storefile里没有要找的内容，直接跳过

[java]view plaincopy 
   
 <span style="font-size:14px;"> if (memOnly == false  
             && ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {  
           scanners.add(kvs);  
         }</span>  

shouldSeek方法：如果是scan直接返回true表明不能跳过，然后根据bloomfilter类型检查。

[java]view plaincopy 
   
 <span style="font-size:14px;"> if (!scan.isGetScan()) {  
         return true;  
       }  
   
       byte[] row = scan.getStartRow();  
       switch (this.bloomFilterType) {  
         case ROW:  
           return passesBloomFilter(row, 0, row.length, null, 0, 0);  
   
         case ROWCOL:  
           if (columns != null && columns.size() == 1) {  
             byte[] column = columns.first();  
             return passesBloomFilter(row, 0, row.length, column, 0,   
                 column.length);  
           }  
   
           // For multi-column queries the Bloom filter is checked from the  
           // seekExact operation.  
           return true;  
   
         default:  
           return true;</span>  

2.指明qualified的scan在配了rowcol的情况下会剔除不会用掉的StoreFile。

对指明了qualify的scan或者get进行检查：seekExactly

[java]view plaincopy 
   
 <span style="font-size:14px;"> // Seek all scanners to the start of the Row (or if the exact matching row  
     // key does not exist, then to the start of the next matching Row).  
     if (matcher.isExactColumnQuery()) {  
       for (KeyValueScanner scanner : scanners)  
         scanner.seekExactly(matcher.getStartKey(), false);  
     } else {  
       for (KeyValueScanner scanner : scanners)  
         scanner.seek(matcher.getStartKey());  
     }</span>  

如果bloomfilter没命中，则创建一个很大的假的keyvalue，表明该storefile不需要实际的scan

[java]view plaincopy 
   
 <span style="font-size:14px;">public boolean seekExactly(KeyValue kv, boolean forward)  
       throws IOException {  
     if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||  
         kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {  
       return forward ? reseek(kv) : seek(kv);  
     }  
   
     boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),  
         kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),  
         kv.getQualifierOffset(), kv.getQualifierLength());  
     if (isInBloom) {  
       // This row/column might be in this store file. Do a normal seek.  
       return forward ? reseek(kv) : seek(kv);  
     }  
   
     // Create a fake key/value, so that this scanner only bubbles up to the top  
     // of the KeyValueHeap in StoreScanner after we scanned this row/column in  
     // all other store files. The query matcher will then just skip this fake  
     // key/value and the store scanner will progress to the next column.  
     cur = kv.createLastOnRowCol();  
     return true;  
   }</span>  

这边为什么是rowcol才能剔除storefile纳，很简单，scan是一个范围，如果是row的bloomfilter不命中只能说明该rowkey不在此storefile中，但next rowkey可能在。而rowcol的bloomfilter就不一样了，如果rowcol的bloomfilter没有命中表明该qualifiy不在这个storefile中，因此这次scan就不需要scan此storefile了！

结论如下：

1.任何类型的get（基于rowkey和基于row+col）bloomfilter都能生效，关键是get的类型要匹配bloomfilter的类型

2.基于row的scan是没办法优化的

3.row+col+qualify的scan可以去掉不存在此qualify的storefile，也算是不错的优化了，而且指明qualify也能减少流量，因此scan尽量指明qualify。