BitSet和布隆过滤器(Bloom Filter)

最新推荐文章于 2024-04-12 18:55:48 发布

weixin_34406796

最新推荐文章于 2024-04-12 18:55:48 发布

阅读量132

点赞数

文章标签：大数据 python 爬虫

原文链接：https://my.oschina.net/ydsakyclguozi/blog/603084

版权

2019独角兽企业重金招聘Python工程师标准>>>

布隆过滤器

Bloom Filter 是由Howard Bloom 在 1970 年提出的二进制向量数据结构，它具有很好的空间和时间效率，被用来检测一个元素是不是集合中的一个成员。如果检测结果为是，该元素不一定在集合中；但如果检测结果为否，该元素一定不在集合中。因此Bloom filter具有100%的召回率。这样每个检测请求返回有“在集合内（可能错误）”和“不在集合内（绝对不在集合内）”两种情况，可见 Bloom filter 是牺牲了正确率和时间以节省空间。

当然布隆过滤器也有缺点，主要是误判的问题，随着数据量的增加，误判率也随着增大，解决办法：可以建立一个列表，保存哪些数值是容易被误算的。

Bloom Filter最大的特点是不会存在false negative，即：如果contains()返回false，则该元素一定不在集合中，但会存在一定的true negative，即：如果contains()返回true，则该元素可能在集合中。

Bloom Filter在很多开源框架都有实现，例如：

Elasticsearch：org.elasticsearch.common.util.BloomFilter

guava：com.google.common.hash.BloomFilter

Hadoop：org.apache.hadoop.util.bloom.BloomFilter（基于BitSet实现）

有兴趣可以看看源码。

BitSet的基本原理

最后再了解一下BitSet的基本原理，BitSet是位操作的对象，值只有0或1，内部实现是一个long数组，初始只有一个long数组，所以BitSet最小的size是64，当存储的数据增加，初始化的Long数组已经无法满足时，BitSet内部会动态扩充，最终内部是由N个long来存储，BitSet的内部扩充和List，Set，Map等得实现差不多，而且都是对于用户透明的。
1G的空间，有 8*1024*1024*1024=8589934592bit，也就是可以表示85亿个不同的数。

BitSet用1位来表示一个数据是否出现过，0为没有出现过，1表示出现过。在long型数组中的一个元素可以存放64个数组，因为Java的long占8个byte=64bit，具体的实现，看看源码：

首先看看set方法的实现：

public void set(int bitIndex) {
   if (bitIndex < 0)   //set的数不能小于0
        throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

   int wordIndex = wordIndex(bitIndex);//将bitIndex右移6位，这样可以保证每64个数字在long型数组中可以占一个坑。
   expandTo(wordIndex);

   words[wordIndex] |= (1L << bitIndex); // Restores invariants
   checkInvariants();
}

get命令实现：

public boolean get(int bitIndex) {
   if (bitIndex < 0)
       throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

   checkInvariants();

   int wordIndex = wordIndex(bitIndex);//和get一样获取数字在long型数组的那个位置。
   return (wordIndex < wordsInUse)
        && ((words[wordIndex] & (1L << bitIndex)) != 0);//在指定long型数组元素中获取值。
}

BitSet容量动态扩展：

private void ensureCapacity(int wordsRequired) {
   if (words.length < wordsRequired) {
        // Allocate larger of doubled size or required size
        int request = Math.max(2 * words.length, wordsRequired);//默认是扩大一杯的容量，如果传入的数字大于两倍的，则以传入的为准。
        // wordsRequired = 传入的数值右移6位 + 1
        words = Arrays.copyOf(words, request);
        sizeIsSticky = false;
   }
}