levelDB布隆过滤器实现分析

最新推荐文章于 2022-12-02 11:41:28 发布

saddlesad

最新推荐文章于 2022-12-02 11:41:28 发布

阅读量342

点赞数 1

分类专栏： CPP 开源项目阅读文章标签： cpp

本文链接：https://blog.csdn.net/saddlesad/article/details/122520265

版权

CPP 同时被 2 个专栏收录

11 篇文章 0 订阅

订阅专栏

开源项目阅读

7 篇文章 0 订阅

订阅专栏

levelDB布隆过滤器实现

布隆过滤器是这样一个数据结构，它可以用来判断某些元素是否可能已存在（于下一层的存储介质中）；它可能把不存在的元素误认为已存在，但不会把已存在的元素误认为不存在。所以如果布隆过滤器认为一个元素不存在，那么它就真的不存在，如果它认为一个元素已存在，那么它只是可能存在。

原理

布隆过滤器的底层是一个位数组结构。

每次add一个元素，布隆过滤器都将使用k个哈希函数对此元素分别计算得到k个哈希值，哈希值模位数组长度的结果对应位数组中一个位置，然后布隆过滤器将这k个位置都置为1.

每次find一个元素，布隆过滤器还是使用同样的k个哈希函数找到此元素对应的k个位置，如果这k个位置都为1，那么此元素可能存在，否则一定不存在。

*一个数学上的结论是：当哈希函数的个数k = ln2 （位数组长度 / 元素个数）时，布隆过滤器获得最优的准确性。

实现

笔者将util/bloom.cc::BloomFilterPolicy类剥离出来，因此下面的代码能够脱离levelDB运行。

哈希函数：

// includes...
inline uint32_t DecodeFixed32(const char* ptr) {
  const uint8_t* const buffer = reinterpret_cast<const uint8_t*>(ptr);

  // Recent clang and gcc optimize this to a single mov / ldr instruction.
  return (static_cast<uint32_t>(buffer[0])) |
         (static_cast<uint32_t>(buffer[1]) << 8) |
         (static_cast<uint32_t>(buffer[2]) << 16) |
         (static_cast<uint32_t>(buffer[3]) << 24);
}

static uint32_t Hash(const char* data, size_t n, uint32_t seed) {
  // Similar to murmur hash
  const uint32_t m = 0xc6a4a793;
  const uint32_t r = 24;
  const char* limit = data + n;
  uint32_t h = seed ^ (n * m);

  // Pick up four bytes at a time
  while (data + 4 <= limit) {
    uint32_t w = DecodeFixed32(data);
    data += 4;
    h += w;
    h *= m;
    h ^= (h >> 16);
  }

  // Pick up remaining bytes
  switch (limit - data) {
    case 3:
      h += static_cast<uint8_t>(data[2]) << 16;
    case 2:
      h += static_cast<uint8_t>(data[1]) << 8;
    case 1:
      h += static_cast<uint8_t>(data[0]);
      h *= m;
      h ^= (h >> r);
      break;
  }
  return h;
}

static uint32_t BloomHash(const string &key)
{
    return Hash(key.data(), key.size(), 0xbc9f1d34);
}

布隆过滤器核心实现：

class BloomFilterPolicy
{
public:
    explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key)
    {
        // We intentionally round down to reduce probing cost a little bit
        k_ = static_cast<size_t>(bits_per_key * 0.69); // 0.69 =~ ln(2)
        if (k_ < 1)
            k_ = 1;
        if (k_ > 30)
            k_ = 30;
    }

    const char *Name() const { return "leveldb.BuiltinBloomFilter2"; }

    void CreateFilter(const string *keys, int n, std::string *dst) const
    {
        // Compute bloom filter size (in both bits and bytes)
        size_t bits = n * bits_per_key_;

        // For small n, we can see a very high false positive rate.  Fix it
        // by enforcing a minimum bloom filter length.
        if (bits < 64)
            bits = 64;

        size_t bytes = (bits + 7) / 8;
        bits = bytes * 8;

        const size_t init_size = dst->size();
        dst->resize(init_size + bytes, 0);
        dst->push_back(static_cast<char>(k_)); // Remember # of probes in filter
        char *array = &(*dst)[init_size];
        for (int i = 0; i < n; i++)
        {
            // Use double-hashing to generate a sequence of hash values.
            // See analysis in [Kirsch,Mitzenmacher 2006].
            uint32_t h = BloomHash(keys[i]);
            const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
            // 取哈希值 + delta * i模total_size()作为置为1的位置，一共有k_个位置。
            // 这里用delta * i巧妙地将k个哈希值的计算化简了
            for (size_t j = 0; j < k_; j++)
            {
                const uint32_t bitpos = h % bits; // bitpos即此元素对应的位置之一
                array[bitpos / 8] |= (1 << (bitpos % 8));
                h += delta;
            }
        }
    }

    bool KeyMayMatch(const string &key, const string &bloom_filter) const
    {
        const size_t len = bloom_filter.size();
        if (len < 2)
            return false;

        const char *array = bloom_filter.data();
        const size_t bits = (len - 1) * 8;

        // Use the encoded k so that we can read filters generated by
        // bloom filters created using different parameters.
        const size_t k = array[len - 1];
        if (k > 30)
        {
            // Reserved for potentially new encodings for short bloom filters.
            // Consider it a match.
            return true;
        }

        uint32_t h = BloomHash(key);
        const uint32_t delta = (h >> 17) | (h << 15); // Rotate right 17 bits
        for (size_t j = 0; j < k; j++)
        {
            const uint32_t bitpos = h % bits; // bitpos即此元素对应的位置之一
            if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0)
                return false;
            h += delta;
        }
        return true;
    }

private:
    size_t bits_per_key_; // 位数组长度 ÷ 元素个数
    size_t k_; // 哈希函数的个数，也是实际位数组中每个元素对应的位的个数
};

注意levelDB的布隆过滤器接口和普通的布隆过滤器不同，一般的布隆过滤器是有一个add接口，每次添加一个元素，而在levelDB中，因为布隆过滤器只服务于一个sstable中的keys，而这些keys都是一次性准备好的，所以提供了一个CreateFilter接口，一次性添加n个keys到布隆过滤器中。

另外，CreateFilter接口需要用户自行提供一个std::string作为位数组。

测试代码：

int main() {
    BloomFilterPolicy policy(10);
    cout << policy.Name() << endl;
    
    vector<string> strs{"hello", "world", "fuck", "i", "5432", "helofxx"};

    std::string dst;
    policy.CreateFilter(strs.data(), strs.size(), &dst);

    cout << boolalpha;
    cout << policy.KeyMayMatch("hello", dst) << endl;
    cout << policy.KeyMayMatch("y", dst) << endl;
    cout << policy.KeyMayMatch("234", dst) << endl;
}

saddlesad

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
levelDB布隆过滤器实现分析

levelDB布隆过滤器实现布隆过滤器是这样一个数据结构，它可以用来判断某些元素是否可能已存在（于下一层的存储介质中）；它可能把不存在的元素误认为已存在，但不会把已存在的元素误认为不存在。所以如果布隆过滤器认为一个元素不存在，那么它就真的不存在，如果它认为一个元素已存在，那么它只是可能存在。原理布隆过滤器的底层是一个位数组结构。每次add一个元素，布隆过滤器都将使用k个哈希函数对此元素分别计算得到k个哈希值，哈希值模位数组长度的结果对应位数组中一个位置，然后布隆过滤器将这k个位置都置为1.每次fi
复制链接

扫一扫