How are bloom filters used in HBase?

最新推荐文章于 2023-07-26 09:28:25 发布

macyang

最新推荐文章于 2023-07-26 09:28:25 发布

阅读量3.3k

点赞数 1

分类专栏： database/nosql 文章标签： filter hbase duplicates parallel generation structure

本文链接：https://blog.csdn.net/macyang/article/details/6182629

版权

database/nosql 专栏收录该内容

102 篇文章 0 订阅

订阅专栏

HBase的两位主要代码提交者Lars George和Nicolas Spiegelberg针对issue: https://issues.apache.org/jira/browse/HBASE-1200 在quora上给出的关于bloom filter是如何在HBase中使用的。

文章来源：http://www.quora.com/How-are-bloom-filters-used-in-HBase

Lars George , HBase Committer

The bloom filters in HBase are good in a few different use-cases. One is access patterns where you will have a lot of misses during reads. The other is to speed up reads by cutting down internal lookups.

They are stored in the meta data of each HFile when it is written and then never need to be updated because HFiles are immutable. While I have no empirical data as to how much extra space they require (this also depends on the error rate you choose etc.) they do add some overhead obviously. When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded into memory and used to determine if a given key is in that store file. They can be scoped on a row key or column key level, where the latter needs more space as it has to store many more keys compared to just using the row keys (unless you only have exactly one column per row).

In terms of computational overhead the bloom filters in HBase are very efficient, they employ folding to keep the size down and combinatorial generation to speed up their creation. They add about 1 byte per entry and are mainly useful when your entry size is on the larger end, say a few kilobytes. Otherwise the size of filter compared to the size of the data is prohibitive.

What also matters is how you actually update data as regular changes on the cells will spread them across all store files, which means you will have to scan all files anyways. Better suited are some sort of batched updates for entities so that you have a chance for specific row keys to be in only a few store files. That way and given you have larger stores, for example 1GB, you can skip a substantial amount of disk IO during the low level scan to find a specific row.

Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader that a key *may* be in the file because it falls into a start and end key range in the block index. But if the key is actually present can only be determined by loading that block and scanning it.

This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid. To perform that actual check the RegionServer has to load the matching block and scan it to check if the key is actually present.

In a very busy system using bloom filters with the matching update or read patterns can save a huge amount of IO obviously. Also bloom filters are easily turned on or off so you can try them out and closely observe how they improve your read performance. Let us know how you do!

Nicolas Spiegelberg , HBase Committer

I added the HBASE-1200 patch, so hopefully I can help answer any questions that you may have. Lars did a good job explaining the bloom filter overview. Bloom Filters are currently hard to performance benchmark, although FB has internally seen benefit in production from before/after perf graphs. Get/Scan(Row) currently does a parallel N-way get of that Row from all StoreFiles in a Region. This means that you are doing N read requests from disk. BloomFilters provide a lightweight in-memory structure to reduce those N disk reads to only the files likely to contain that Row (N-B).

Since the reads are in parallel, you don't see much perf gain on an individual get, which is dominated by disk read latency. If later, we allow a serial get option (beneficial for data that is overwritten), then you'll see bloom filters have immediate benefit on read latency. Where will you see benefit? Check out 'blockCacheHitRatio' from your Region Server Metrics (http://hbase.apache.org/docs/r0.... ). The block cache hit ratio should go up with bloom filters enabled because the bloom filter is filtering out blocks that you don't need, which frees more room for valid blocks.

The reason why bloom filters aren't enabled by default is because the bloom data itself might be more heavyweight that your actual data. Think about this, you're using a counter and have tiny row names. So maybe, 20 bytes for KV. Then blooms would be 1/20th of your hfile (assuming no duplicates). In contrast, we're storing messages, so roughly 1KB for data + 100B for the key, so blooms are roughly 1/1100 (or .009%) of the HFile. In addition, we have duplicate keys, so that number is even lower. Also, additionally, we do multiple columns and have a row-level bloom. We have roughly a 56KB bloom for a 5GB file!

Another big question: row bloom or row+col? Depends on your query pattern + put pattern + the above problem. If you are doing Row scans, then Row+Col doesn't buy you anything. A row bloom can filter on a Row+Col get, but not the other way around. However, the problem with Row blooms is that you may have a ton of column-level puts and end up with a scenario where a Row is present in every StoreFile. Then the bloom filter is wasteful because it's always positive.

Finally, if some other DB tells you that they have a one-size-fits-all bloom implementation, don't believe them! We might be able to later, but auto-tuning will require a lot of data mining analysis of your specific workload to get right. Ah, the fun of being a DBA! :)