布隆过滤器简单入门

Engureggg

已于 2022-06-12 22:35:12 修改

阅读量8k

点赞数

文章标签： java

于 2021-08-31 22:05:45 首次发布

本文链接：https://blog.csdn.net/qq_43341057/article/details/120027164

版权

文章目录

布隆过滤器

原理：通过映射函数，将一个元素散射到一个二进制向量中

判断元素是否存在：

如果判断结果是元素不在集合中，那么肯定正确的（就表明不在）
如果判断结果是元素在集合中，那么有一定的概率是判断错了（也能不在集合中）

原理图（来自网络，侵删）
在这里插入图片描述

特点

优点：空间效率和查询时间远超一般的算法
缺点：有一定的误识别率，删除困难

问题

Q: 判断元素是存在的，但实际上是不存在的。怎么判断是否真正存在呢？
A: 使用过滤器拦截了大部分数据，剩余一小部分可能会出错，对着这部分需要通过查询数据库进行校验

使用场景

字处理软件中，需要检查一个英语单词是否拼写正确
在 FBI，一个嫌疑人的名字是否已经在嫌疑名单上
在网络爬虫里，一个网址是否被访问过
邮箱垃圾邮件过滤功能
比特币网络

demo 程序

创建 maven 项目，引入 guava 依赖

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>19.0</version>
</dependency>

demo 程序：

将 100w 个数据放入布隆过滤器
查找这 100w 个元素中 “不存在” 的元素
查找 100w 个 ”不存在“ 元素 “存在” 概率

@Test
public void mmm() {
    int size = 1000_000;
    
    BloomFilter<String> bloomFilter = BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()), size);

	// 放入 100w 个整数到布隆集合（并不是真正存储）
    for (int i = 0; i < size; i++) {
        bloomFilter.put(i + "");
    }
		
	// 判断存在的元素
    for (int i = 0; i < size; i++) {
        boolean exist = bloomFilter.mightContain(i + "");
        if (!exist) {
            // 如果判定为不存在，那么一定不存在
            // 因为 1~100w 都被“加入”到了集合中，所以下面这句不会执行
            System.out.println("发现漏网之鱼！！！" + i);
        }
    }

    // 计算不存在的元素的存在概率
    int mistake = 0;
    for (int i = size; i < size * 2; i++) {
        boolean exist = bloomFilter.mightContain(i + "");
        if (exist) {
            // 如果判定为存在，那么可能存在，也可能不存在
            // 做一个累计吧
            mistake++;
        }
    }
    System.out.println("误判率：" + (mistake * 1.0) / size);// 0.030094
}

0.030094 哪里来？

// 布隆过滤器部分源码
public static <T> BloomFilter<T> create(
	Funnel<? super T> funnel, 
	long expectedInsertions) {
		return create(funnel, expectedInsertions, 0.03); // 这里设置的误检率为 0.03 !!!
		// FYI, for 3%, we always get 5 hash functions
	}

public static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, 
      long expectedInsertions, 
      double fpp) {
				return create(funnel, expectedInsertions, fpp, 
				BloomFilterStrategies.MURMUR128_MITZ_64);
}

//funnel: the funnel of T's that the constructed BloomFilter<T> will use
//expectedInsertions: the number of expected insertions to the constructed BloomFilter<T>
//fpp: the desired false positive probability (must be positive and less than 1.0)
static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, 
      long expectedInsertions, 
      double fpp, 
      Strategy strategy) {
	
	// 计算 bit 向量长度
	long numBits = optimalNumOfBits(expectedInsertions, fpp);
	// 计算哈希散列函数数量
	int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
	try {
		//创建布隆过滤器
		return new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy);
	} catch (IllegalArgumentException e) {
		throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
}

注：fpp （false positive probability），即假阳性概率、误检率

程序调试

fpp = 0.03，expectedInsertions = 1000000

在这里插入图片描述

关键步骤：

numBits = optimalNumOfBits(expectedInsertions, fpp) 根据误判率和元素数量计算出比特向量长度
numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits) 根据元素数量和比特向量长度计算出散列函数数量
new BloomFilter<T>(new BitArray(numBits), numHashFunctions, funnel, strategy) 返回布隆过滤器对象

调整参数 - 设置错误率为 0.0001

BloomFilter<String> bloomFilter = 
	BloomFilter.create(Funnels.stringFunnel(Charset.defaultCharset()),
				        size, 
				        0.0001);

调试结果

在这里插入图片描述

对比

参数	fpp = 0.03	fpp = 0.0001
numBits	70000	20000000
numHashFunctions	5	13

结论：若要追求较小的误检率，则需要一个较大的比特向量

Engureggg

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫