# 利用Murmurhash实现Bloom filter（布隆过滤器）

18 篇文章 0 订阅

关于布隆过滤器的介绍网上有很多，但都没有涉及具体如何实现，尤其是最后的k个相互独立的哈希函数如何实现。

具体实现步骤如下：

（关于布隆过滤器的介绍和相关证明，维基百科是看过的最好的：http://en.wikipedia.org/wiki/Bloom_filter ）

（1）确定过滤器大小：

假设我们要处理的数据总数是N，可以容忍的错误率是P，那么我们首先需要确定出过滤器的slot数 M = -N*lnP/(ln2)^2。有了M我们就可以声明过滤器数组了。

（2）确定哈希函数个数：

通过上一步的M，我们可以求得哈希函数个数 K = M/N*ln2 。

（3）设计哈希函数：

这是关键性的一步了，关于哈希函数的要求维基上有明确说明

The requirement of designing k different independent hash functions can be prohibitive for large k. For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple "different" hash functions by slicing its output into multiple bit fields.

要求是需要Independent，想想K如果大于10的话，恐怕绞尽脑汁也想不出来吧。这就需要找个巨人的肩旁站站了！

找了一个simple implementation版本，代码如下：

//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
// 'm' and 'r' are mixing constants generated offline.
// They're not really 'magic', they just happen to work well.

const unsigned int m = 0x5bd1e995;
const int r = 24;

// Initialize the hash to a 'random' value

unsigned int h = seed ^ len;

// Mix 4 bytes at a time into the hash

const unsigned char * data = (const unsigned char *)key;

while(len >= 4)
{
unsigned int k = *(unsigned int *)data;

k *= m;
k ^= k >> r;
k *= m;

h *= m;
h ^= k;

data += 4;
len -= 4;
}

// Handle the last few bytes of the input array

switch(len)
{
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0];
h *= m;
};

// Do a few final mixes of the hash to ensure the last few
// bytes are well-incorporated.

h ^= h >> 13;
h *= m;
h ^= h >> 15;

return h;
}


#include <iostream>

using namespace std;

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed );

int main() {
unsigned int result = MurmurHash2("abcd",4,1);
cout<<result<<endl;
}

输出结果为：3376380438

那么，如何在我们的过滤器中使用这个哈希函数呢，这里搞明白它的参数 key是你的数据，len是数据长度，seed就是你要的版本了，你可以给它赋值1、2、3、4...这样就得到了不同的版本。

还有，如何将返回值利用到我们的过滤器中，比如我们的slot编号为0到1024，3376380438对应哪一个slot？

我们可以二次哈希一下，3376380438 % 1024，对应的slot号就有了。

其实，关于seed，严谨的意思这样的：

The seed parameter is a means for you to randomize the hash function. You should provide the same seed value for all calls to the hashing function in the same application of the hashing function. However, each invocation of your application (assuming it is creating a new hash table) can use a different seed, e.g., a random value.

Why is it provided?

One reason is that attackers may use the properties of a hash function to construct a denial of service attack. They could do this by providing strings to your hash function that all hash to the same value destroying the performance of your hash table. But if you use a different seed for each run of your program, the set of strings the attackers must use changes.

好了，现在有了过滤器数组，也有了hash，搞定了！

另外，推荐一篇关于布隆过滤器的好文章，分析了性能、折叠和动态扩展：http://www.yankay.com/%E6%9F%A5%E8%AF%A2%E5%88%A9%E5%99%A8-bloom-filter%E8%AF%A6%E8%A7%A3/

• 0
点赞
• 0
评论
• 1
收藏
• 打赏
• 扫一扫，分享海报

02-19 480

yxc135

¥2 ¥4 ¥6 ¥10 ¥20

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。