布隆过滤器（Bloom Filter）原理解释及代码详细分析-CSDN博客

本文链接：https://blog.csdn.net/2403_88929394/article/details/148010647

布隆过滤器概述

布隆过滤器（Bloom Filter）是一种空间效率极高的概率型数据结构，由 Burton Howard Bloom 在 1970 年提出。它的核心功能是判断一个元素是否存在于集合中，具有以下特点：

优点：空间效率和查询时间远超传统算法（如哈希表）
缺点：存在误判率（False Positive），但不会漏判（False Negative）
应用场景：URL 过滤（如网页爬虫）、缓存穿透防护、垃圾邮件过滤等

代码实现解析

哈希函数模块

字符串哈希函数是将任意长度的字符串映射为固定长度的整数的算法，在数据结构（如哈希表、布隆过滤器）中起着关键作用。以下详细解析代码中实现的 11 种哈希函数：

1. BKDR Hash

template <class T>
size_t BKDRHash(const T* str)
{
    size_t hash = 0;
    while (size_t ch = (size_t)*str++)
    {
        hash = hash * 131 + ch; 
    }
    return hash;
}

特点：

乘法因子 131：实验证明该值能有效降低哈希冲突
简单高效：位运算少，适合大规模数据处理
应用广泛：被广泛用于各种哈希表实现中

2. SDBM Hash

template<class T>
size_t SDBMHash(const T* str)
{
    size_t hash = 0;
    while (size_t ch = (size_t)*str++)
    {
        hash = 65599 * hash + ch;
    }
    return hash;
}

特点：

乘法因子 65599：特殊质数，能有效分散哈希值
类似数据库索引：最初用于 SDBM 数据库系统
递推关系：hash(i) = hash(i-1) * 65599 + str[i]

3. RS Hash

template<class T>
size_t RSHash(const T* str)
{
    size_t hash = 0;
    size_t magic = 63689;
    while (size_t ch = (size_t)*str++)
    {
        hash = hash * magic + ch;
        magic *= 378551;
    }
    return hash;
}

特点：

动态乘法因子：每次迭代更新 magic 值
权重递减：越前面的字符对哈希值影响越大
适用于长字符串：能有效处理长文本的哈希计算

4. AP Hash

template<class T>
size_t APHash(const T* str)
{
    size_t hash = 0;
    size_t ch;
    long i = 0;
    while ((ch = (size_t)*str++) != 0)
    {
        if ((i & 1) == 0)
        {
            hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
        }
        else
        {
            hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
        }
        i++;
    }
    return hash;
}

特点：

奇偶位不同处理：根据字符位置采用不同的位运算
复杂异或操作：通过多次异或和移位增加随机性
平衡性好：能在不同数据集上保持较低冲突率

5. JS Hash

template<class T>
size_t JSHash(const T* str)
{
    if (!*str) return 0;
    size_t hash = 1315423911;
    while (size_t ch = (size_t)*str++)
    {
        hash ^= ((hash << 5) + ch + (hash >> 2));
    }
    return hash;
}

特点：

初始值较大：使用质数 1315423911 作为初始值
混合运算：结合移位、加法和异或操作
良好的雪崩效应：输入的微小变化会导致哈希值大幅变化

6. DEK Hash

template<class T>
size_t DEKHash(const T* str)
{
    if (!*str) return 0;
    size_t hash = 1315423911;
    while (size_t ch = (size_t)*str++)
    {
        hash = ((hash << 5) ^ (hash >> 27)) ^ ch;
    }
    return hash;
}

特点

类似旋转哈希：通过左右移位结合异或操作
旋转因子 5 和 27：根据数据类型大小自动调整
适合分布式系统：在分布式哈希表中表现良好

7. FNV Hash

template<class T>
size_t FNVHash(const T* str)
{
    if (!*str) return 0;
    size_t hash = 2166136261;
    while (size_t ch = (size_t)*str++)
    {
        hash *= 16777619;
        hash ^= ch;
    }
    return hash;
}

特点：

FNV 算法：由 Glenn Fowler、Landon Curt Noll 和 Kiem-Phong Vo 共同设计
质数乘法：使用质数 16777619 进行乘法运算
快速且分布均匀：被广泛用于哈希表和校验和计算

8. DJB Hash

template<class T>
size_t DJBHash(const T* str)
{
    if (!*str) return 0;
    size_t hash = 5381;
    while (size_t ch = (size_t)*str++)
    {
        hash += (hash << 5) + ch;
    }
    return hash;
}

特点：

初始值 5381：由 Daniel J. Bernstein 选择的特殊值
位移运算优化：hash * 33被优化为hash + (hash << 5)
在短字符串上表现优异：常用于小型字符串的哈希计算

9. DJB2 Hash

template<class T>
size_t DJB2Hash(const T* str)
{
    if (!*str) return 0;
    size_t hash = 5381;
    while (size_t ch = (size_t)*str++)
    {
        hash = hash * 33 ^ ch;
    }
    return hash;
}

特点：

DJB 算法变种：与 DJB Hash 类似但使用异或操作
乘法因子 33：实验证明该值能有效降低冲突率
简单高效：被广泛用于各种哈希表实现

10. PJW Hash

template<class T>
size_t PJWHash(const T* str)
{
    static const size_t TotalBits = sizeof(size_t) * 8;
    static const size_t ThreeQuarters = (TotalBits * 3) / 4;
    static const size_t OneEighth = TotalBits / 8;
    static const size_t HighBits = ((size_t)-1) << (TotalBits - OneEighth);
    size_t hash = 0;
    size_t magic = 0;
    while (size_t ch = (size_t)*str++)
    {
        hash = (hash << OneEighth) + ch;
        if ((magic = hash & HighBits) != 0)
        {
            hash = ((hash ^ (magic >> ThreeQuarters)) & (~HighBits));
        }
    }
    return hash;
}

特点：

基于位操作：通过移位和掩码处理哈希值
高位移位处理：防止哈希值溢出并保持分布均匀
Unix 系统：最初用于 Unix 系统中的哈希表实现

11. ELF Hash

template<class T>
size_t ELFHash(const T* str)
{
    static const size_t TotalBits = sizeof(size_t) * 8;
    static const size_t ThreeQuarters = (TotalBits * 3) / 4;
    static const size_t OneEightn = TotalBits / 8;
    static const size_t HighBits = ((size_t)-1) << (TotalBits - OneEightn);
    size_t hash = 0;
    size_t magic = 0;
    while (size_t ch = (size_t)*str++)
    {
        hash = (hash << OneEightn) + ch;
        if ((magic = hash & HighBits) != 0)
        {
            hash ^= (magic >> ThreeQuarters);
            hash &= ~magic;
        }
    }
    return hash;
}

特点：

类 PJW 算法：与 PJW Hash 类似但采用不同的处理方式
异或和掩码操作：有效处理哈希冲突
在 UNIX ELF 格式：最初用于 Unix 系统的 ELF 文件格式

哈希函数性能对比

哈希函数	速度	分布均匀性	冲突率	适用场景
BKDRHash	快	中	中	通用场景
SDBMHash	快	中	中	数据库索引
RSHash	中	高	低	长字符串
APHash	中	高	低	通用场景
FNVHash	快	高	低	校验和计算
DJBHash	极快	中	中	短字符串
PJWHash	慢	高	低	UNIX 系统哈希表

布隆过滤器核心实现

布隆过滤器核心实现布隆过滤器的核心是一个位图（BitArray）和多个哈希函数：

class BloomFilter
{
private:
    int bitSize_;         // 位图大小
    vector<int> bitMap_;  // 位图数组（每个int包含32位）

public:
    BloomFilter(int size = 1471) : bitSize_(size)
    {
        bitMap_.resize(bitSize_ / 32 + 1);
    }

    // 添加元素到布隆过滤器
    void setBit(const char* str);

    // 查询元素是否可能存在
    bool getbit(const char* str);
};

添加操作（setBit）

void setBit(const char* str)
{
    // 计算3个哈希值
    int idx1 = BKDRHash(str) % bitSize_;
    int idx2 = RSHash(str) % bitSize_;
    int idx3 = APHash(str) % bitSize_;

    // 将对应位设置为1
    int index = idx1 / 32;
    int offset = idx1 % 32;
    bitMap_[index] |= (1 << offset);

    // 重复操作另外两个哈希值...
}

查询操作（getbit）

bool getbit(const char* str)
{
    // 计算3个哈希值
    int idx1 = BKDRHash(str) % bitSize_;
    int idx2 = RSHash(str) % bitSize_;
    int idx3 = APHash(str) % bitSize_;

    // 检查对应位是否都为1
    int index = idx1 / 32;
    int offset = idx1 % 32;
    if (0 == (bitMap_[index] & (1 << offset))) return false;

    // 检查另外两个位...

    return true;
}

黑名单应用层

代码中通过Blacklist类封装布隆过滤器，提供更易用的接口：

int main()
{
    Blacklist list;
    list.add("http://www.baidu.com");
    list.add("http://www.360.com");
    list.add("http://www.tmall.com");
    list.add("http://www.tencent.com");
    list.add("aba");
    
    string url = "b";
    cout << list.query(url); // 输出0（不存在）或1（可能存在）
}