数据算法：Bloom Filter

最新推荐文章于 2025-03-21 18:49:59 发布

GatsbyNewton

最新推荐文章于 2025-03-21 18:49:59 发布

阅读量826

点赞数

分类专栏：数据算法文章标签： Bloom Filter

本文链接：https://blog.csdn.net/u010376788/article/details/106033006

版权

数据算法专栏收录该内容

4 篇文章

订阅专栏

我们在一些体量亿级的网站或平台注册账号的时候，输入完用户名或账号回车可能会遇到提示：“用户名已存在”。系统是如何这么快速的判断出用户名存在与否的呢？这有很多种解决方案：

线性查找：时空复杂度都很高。
二分查找：首先需要将所有的用户名进行排序，对于亿级数据排序是个比较耗资源的事情。

另一种数据结构 Bloom Filter 也可以解决上面的问题，要理解 Bloom Filter，首先需要了解 Hash，关于 Hash 本文不做赘述。

1. 什么是 Bloom Filter?

Bloom Filter 是一种空间高效的概率型数据结构，它由 Burton Howard Bloom 在 1970 年提出，用来检验一个元素是否属于一个集合。有可能会得到假阳性（False Positive）匹配判断，但是不会得到假阴性错误。换言之，查询会返回：

可能在集合中；
肯定不在集合中。

Bloom Filter 的性质：

与标准哈希表不同，固定大小的 Bloom Filter 可以表示包含任意数量元素的集合。
给定一个大集合 ${x_{1}, x_{2}, \cdots, x_{n}}$ ，基本上来说，Bloom Filter 近似于集合成员操作： $\in S$ （ $x$ 是否属于 $S$ ）。
Bloom Filter 可能存在假阳性错误，因为这只会带来一个额外的数据访问操作，而不会导致错误的答案。即，对于不存在集合中的某个元素 $x$ ，Bloom Filter 可能会返回这个元素 $x$ 存在该集合中。
Bloom Filter 不会有假阴性错误，因为这会导致错误的答案。换句话说，如果 $x$ 在集合中，Bloom Filter 必然会指出该元素 $x$ 确实存在于该集合中。
添加元素永远不会失败。但是，随着元素的添加，假阳性率也随之上升，直到 Bloom Filter 中的所有比特都设置为 1，此时所有查询都会得到一个阳性结果。
从 Bloom Filter 中删除元素是不可能的，因为通过清除 $k$ 个哈希函数生成的索引上的位来删除单个元素，可能会删除其他一些元素。比如，从图 1 删除索引位为 (2, 5, 6) 的 bloom 的时候，同时会删除 filter，因为 filter 的索引位 5 被清除了。

存在两种可能的错误：

假阳性（False Positive Errors）： $\notin S$ ，但答案是 $\in S$ 。
假阴性（False Negative Errors）： $\in S$ ，但答案是 $\notin S$ 。

2. Bloom Filter 的误判率

Bloom Filter 的一个不足之处就是存在假阳性错误，可能把不在集合中的元素错判成集合中的元素。

假阳性概率的计算过程：假定 Bloom Filter 有 $m$ 比特，里面有 $n$ 个元素，每个元素对应 $k$ 个哈希函数，当然这 $m$ 比特里面有些是 1，有些是 0。先计算某个比特为 0 的概率。假如，在这个 Bloom Filter 中插入一个元素，它的第一个哈希函数会把过滤器中的某个比特置成 1，因此，任何一个比特被置成 1 的概率是 $\frac{1}{m}$ ，它是 0 的概率则是 $\frac{1}{m}$ 。

对于过滤器中特定的位置，如果这个元素的 $k$ 个哈希函数都没有把它设置成 1，其概率是 $\frac{1}{m})^{k}$ 。如果过滤器中插入第二个元素，某个特定的位置依然没有被设置成 1，其概率是 $\frac{1}{m})^{2k}$ 。如果插入了 $n$ 个元素还没把某个位置设置成 1，其概率是 $\frac{1}{m} )^{nk}$ 。反过来，一个比他在插入了 $n$ 个元素后，被设置成 1 的概率则是 $\frac{1}{m})^{nk}$ 。

现在假定这 $n$ 个元素都放到了 Bloom Filter 中了，又新来了一个不在集合中的元素，由于它的哈希函数都是随机的，因此，它的任意一个哈希函数正好命中某个值为 1 的比特的概率就是 $\frac{1}{m})^{nk}$ 。一个不在集合中的元素被误判成在集合中，需要所有的散列函数对应的比特值都是 1，其假阳性错误的概率是：
$\frac{1}{m})^{kn}]^{k}$

由 $\lim_{m \to \infty}(1 - \frac{1}{m})^{m} = \frac{1}{e}$ ，将 $e^{-\frac{n}{m}}$ 代入得：
$\frac{1}{m})^{kn}]^{k} \approx (1 - e^{-\frac{kn}{m}})^{k} = (1 - p^{k})^{k}$

对化简后的 $lnf(k) = kln(1 - p^{k})$ 中的 $f (k)$ 求导得：
$p^{k}) - \frac{p^{k}lnp^{k}}{1 - p^{k}}] \cdot (1 - p^{k})^{k}$

求最值，令 ${f}'(k) = 0$ ，由 $1 - p^{k})^{k} > 0$ 得：
$1 - p^{k})ln(1 - p^{k}) = p^{k}lnp^{k}$

所以 $p^{k}= \frac{1}{2}$ ，将 $e^{-\frac{n}{m}}$ 代入得到当哈希函数的数量 $k$ 和 Bloom Filter 选择的位数 $m$ 、数据集的大小 $n$ 满足下式时，假阳性错误的概率的概率最小。
$\frac{m}{n} ln2$

将 $\frac{m}{n} ln2$ 代入 $f (k)$ 得到：
$e^{-(\frac{m}{n}ln2)\frac{n}{m}}]^{\frac{m}{n}ln2}$

化简可得：
$\frac{nlnp}{(ln2)^{2}}$

更详细的推导过程可以查看 Wikipedia 的 Bloom filter。

3. Bloom Filter 示例

假设一个 Bloom Filter，有 10 比特（ $m = 10$ ）和三个散列函数 $H_{1}(x), H_{2}(x), H_{3}(x)$ ，且有 $\left \{ H_{1}(x), H_{2}(x), H_{3}(x) \right \}$ 。如图 1 我们把一个 10 比特的数组 $B$ 初始化为 0。
图2 bit 数组初始化
插入元素 bloom， $H (b l o o m) = (2, 5, 6)$ 后如图 3。
图3 插入元素 bloom
再插入元素 filter， $H (f i l t e r) = (1, 5, 8)$ 后如图 4。
图4 插入元素 filter
【实验 1】假设查询元素 test ，且 $H (t e s t) = (5, 8, 9)$ ，则 Bloom Filter 判断 test 不是集合中的元素，因为 $B [9] = 0$ 。

【实验 2】假设查询元素 hello ，且 $H (h e l l o) = (2, 5, 8)$ ，则 Bloom Filter 判断 hello 是集合中的元素，虽然 hello 确实不在集合中，但是 hello 的哈希后的索引上位都是 1。此时即为假阳性错误。

【实验 3】假设查询元素 world ，且 $H (w o r l d) = (1, 2, 6)$ ，则 Bloom Filter 判断 world 是集合中的元素，同理虽然 world 确实不在集合中，但是 world 的哈希后的索引上位都是 1。此时亦为假阳性错误。

【实验 4】假设查询元素 bloom ，且 $H (b l o o m) = (2, 5, 6)$ ，则 Bloom Filter 判断 bloom 是集合中的元素，因为 bloom 的哈希后的索引上位都是 1。

Bloom Filter 的简单实现：

public class BloomFilter {
    /* bit array的size */
    private int size;

    /* 哈希函数的个数 */
    private int hashCount;

    /* Bloom Filter的bit array */
    private BitSet bitArray;

    /* Bloom Filter的False Positive probability */
    private float falsePositiveProb;

    /**
     * @param itemCount 预期存储在Bloom Filter中的元素个数
     * @param falsePositiveProb Bloom Filter的False Positive probability
     */
    public BloomFilter(int itemCount, float falsePositiveProb) {
        this.size = this.computeSize(itemCount, falsePositiveProb);
        this.hashCount = this.computeHashCount(size, itemCount);
        this.falsePositiveProb = falsePositiveProb;
        this.bitArray = new BitSet(size);
    }

    /**
     * 用公式 m = -(n * ln(p)) / (ln(2)^2) 计算bit array的size
     * @param n 预期存储在Bloom Filter中的元素个数
     * @param p Bloom Filter的False Positive probability
     * @return bit array的size
     */
    private int computeSize(int n, float p) {
        double m = -(n * Math.log(p)) / (Math.pow(Math.log(2), 2));
        return (int) m;
    }

    /**
     * 用公式 k = (m/n) * ln(2) 计算哈希函数的个数
     * @param m bit array的size
     * @param n 预期存储在Bloom Filter中的元素个数
     * @return 哈希函数的个数
     */
    private int computeHashCount(int m, int n) {
        double k = (m/n) * Math.log(2);
        return (int) k;
    }

    public void add(Object item) {
        for (int i = 0; i < hashCount; i++) {
            int index = Math.abs(MurmurHash.hash(Objects.toString(item).getBytes(), i)) % size;
            bitArray.set(index);
        }
    }

    public boolean check(Object item) {
        for (int i = 0; i < hashCount; i++) {
            int index = Math.abs(MurmurHash.hash(Objects.toString(item).getBytes(), i)) % size;
            if (!bitArray.get(index)) {
                return false;
            }
        }

        return true;
    }

    public float getFalsePositiveProb() {
        return falsePositiveProb;
    }

    public int getSize() {
        return size;
    }

    public int getHashCount() {
        return hashCount;
    }
}