布隆过滤器算法实现_布隆过滤器

最新推荐文章于 2024-07-24 14:07:28 发布

cunchi8090

最新推荐文章于 2024-07-24 14:07:28 发布

阅读量396

点赞数 1

文章标签：算法数据结构 java python redis

原文链接：https://www.experts-exchange.com/articles/31075/Bloom-Filters.html

版权

布隆过滤器算法实现

In general, the worse case scenario when searching through a data set is when the datum being searched for doesn’t exist. In this case, the complete data storage needs to be searched before it’s possible to conclude that the datum cannot be found. If only there was a way to eliminate the need to perform unnecessary searches when we know the data won’t be found. Fortunately, there is a data structure that allows you to do just that. This data structure is knows as a “Bloom Filter”.

通常，在搜索数据集时，最糟糕的情况是不存在要搜索的数据。在这种情况下，需要先搜索完整的数据存储，然后才能得出无法找到基准的结论。当我们知道找不到数据时，如果只有一种方法可以消除执行不必要搜索的需要。幸运的是，有一个数据结构可以让您做到这一点。该数据结构被称为“布鲁姆过滤器”。

So, how does this witchcraft work I can hear you ask? Well, a Bloom Filter is a probabilistic data structure that can tell you if an item definitely doesn’t exist in a data set. What it can’t tell you, with any degree of certainty, is if an item does exist. This clever data structure was invented by a chap called Burton Howard Bloom circa 1970. The power of a Bloom Filter is that checking it for existence of a datum is significantly faster than checking a complete data store. For those who are into Big O notation, the time complexity for performing a check on a Bloom Filter is O(k), where k is the number of hash functions used and bits set (see below).

那么，我能听到你问的这种巫术如何运作？嗯， Bloom Filter是一种概率数据结构，可以告诉您数据集中是否绝对不存在某项。它不能肯定地告诉您项目是否存在。这种聪明的数据结构是由大约在1970年的Burton Howard Bloom发明的。BloomFilter的强大之处在于，检查数据是否存在基准比检查完整的数据存储要快得多。对于使用Big O表示法的人，在Bloom Bloom过滤器上执行检查的时间复杂度为O（k），其中k是使用的哈希函数数和设置的位（请参见下文）。

The fact we can know for definite that an item doesn’t exist in a data store means we don’t have to waste our time searching for something we’re definitely not going to find. Unfortunately, we can only know for sure that data doesn't exist. We can get false positive hits, which means the data might exist and in these cases a search of the data store is still necessary. With careful usage of a Bloom Filter, we can avoid performing expensive searches if we know the data won’t be found. Neat, eh?

我们可以确定地知道数据存储中不存在某项事实，这意味着我们不必浪费时间来寻找绝对不会找到的东西。不幸的是，我们只能确定不存在数据。我们会得到错误的肯定命中，这意味着数据可能存在，在这种情况下，仍然需要搜索数据存储。通过谨慎使用布隆过滤器，如果我们知道找不到数据，则可以避免执行昂贵的搜索。整洁吧？

The principle of how a Bloom Filter works is quite simple, when an item is “added” to the Bloom Filter a statistically unique “bit pattern” is generated using a series of hash functions (one hash function for each bit), which is then written into a single bit vector (the same bit vector is used to store all bit patterns - hence the possibility of a false positive). When checking to see if an item has been added to the Bloom Filter we check to see if the same bit pattern exists in the bit vector. If it doesn’t then we know the item was never added to the filter and so won't be found in the data store. If we find a matching bit pattern the item may exist in the data store and so a full search is required.

布隆过滤器的工作原理非常简单，当将项目“添加”到布隆过滤器时，会使用一系列哈希函数（每个比特一个哈希函数）生成统计上唯一的“位模式”，然后写入单个位向量（相同的位向量用于存储所有位模式-因此可能会出现误报）。当检查是否有项目已添加到布隆过滤器时，我们检查以查看位向量中是否存在相同的位模式。如果不是，那么我们知道该项目从未添加到过滤器中，因此不会在数据存储中找到。如果我们找到匹配的位模式，则该项目可能存在于数据存储中，因此需要完全搜索。

The reason we can’t know for sure that it wasn’t added is because, over time, as more items are added to the filter there is a chance that there will be a collision on the specific bit pattern for a particular item, because one or more other items may have generated the same bit pattern (either singularly or as a group). This means, we cannot say for sure that an item does exist, only that it doesn’t – if the bit pattern for a particular item isn’t set then it cannot exist.

我们不确定是否未添加它的原因是，随着时间的流逝，随着更多项添加到过滤器中，特定项的特定位模式有可能会发生冲突，因为一个或多个其他项可能已经生成了相同的位模式（单个或成组）。这意味着，我们不能肯定地说一个项目确实存在，只能说它不存在–如果未设置特定项目的位模式，则该项目将不存在。

Let’s consider an example. Let’s assume we have a very simple Bloom Filter that is using a 16 bit filter (normally we'd use many more bits that this). We’re going to add three numbers, 21, 34 and 57, and each number will generate 3 unique bit patterns (note, that these bits are just an example and the actual number and value of the bits set in a real Bloom Filter will depend on the number and type of hash functions used):

让我们考虑一个例子。假设我们有一个非常简单的Bloom过滤器，它使用的是16位过滤器（通常我们会使用更多的位）。我们将添加三个数字21、34和57，每个数字将生成3个唯一的位模式（请注意，这些位仅作为示例，在实际Bloom Filter中设置的位的实际数量和值将取决于使用的哈希函数的数量和类型）：

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  21  |   |   | X |   |   |   |   | X | X |   |   |   |   |   |   |   | 
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  34  |   |   |   |   | X |   |   |   |   | X |   |   | X |   |   |   | 
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  57  |   |   |   |   |   |   |   |   | X |   |   | X |   |   |   | X | 
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Now, let’s assume we want to see if number 85 is in the set. This generates the following bit pattern:

现在，假设我们要查看数字85是否在集合中。这将生成以下位模式：

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  85  |   |   | X |   |   |   |   |   |   |   |   |   | X |   | X |   | 
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

Bit 2 is set in the bit vector, bit 12 is set, but bit 14 is not set and so we know for certain 85 is not in there; we can be sure it doesn’t exist in the data set.

在位向量中设置了位2，在位12中设置了位，但是未设置位14，因此我们可以肯定的是其中没有85。我们可以确定它在数据集中不存在。

Now, let’s do the same for 91.

现在，让我们对91做同样的事情。

+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| bits | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
|  91  |   |   |   |   |   |   |   | X |   |   |   | X |   |   |   | X | 
+------+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

In this case, we have a collision of bits, that just happen to match with numbers 21 and 57, which means 91 might exist in the data set and so a full search of the data store is necessary. Notice that we’ve never actually added 91, but because we have a collision cause by bits set when adding 21 (bit 7) and 57 (bits 11 and 15) we cannot rule out that 91 might exists, hence we have no choice but to search the data store to see if this number exists or not.

在这种情况下，我们碰巧发生了位冲突，恰好与数字21和57匹配，这意味着数据集中可能存在91，因此有必要对数据存储进行全面搜索。请注意，我们实际上从未添加过91，但是由于添加21（第7位）和57（第11和15位）时设置的位会导致碰撞，因此我们不能排除存在91的可能，因此我们别无选择搜索数据存储以查看此数字是否存在。

This is why a Bloom Filter is probabilistic. We can say an item definitely doesn’t exist or it might exist. This doesn’t help us in the case where data might exist, but it can save us performing unnecessary expensive searches when we know the data definitely doesn’t exist. That’s the job of a Bloom Filter, to allow us to identify those cases where data definitely doesn’t exist, hopefully saving us from performing an expensive search.

这就是为什么布隆过滤器是概率性的。我们可以说某个项目绝对不存在或可能存在。这在可能存在数据的情况下对我们没有帮助，但是当我们知道数据绝对不存在时，可以节省我们执行不必要的昂贵搜索的时间。这就是布隆过滤器的工作，它使我们能够确定那些绝对不存在数据的情况，从而有望使我们免于进行昂贵的搜索。

Now, for a Bloom Filter to be useful, we want to avoid as many collisions as we can and there is a trade off between the number of bits we set when adding a value (a Bloom Filter can work with any data, not just numbers), the size of the bit vector and the number of values we add. The more bits you set the less chance of a false positive; however the smaller your bit vector the quicker the space will become polluted and so the more chance of a false positive.

现在，为了使布隆过滤器有用，我们希望尽可能避免发生冲突，并且在添加值时设置的位数之间需要权衡取舍（布隆过滤器可以处理任何数据，而不仅仅是数字），位向量的大小以及我们添加的值数量。您设置的位数越多，出现误报的机会就越少；但是位向量越小，空间将受到污染的速度越快，因此出现假阳性的机会就越大。

Unfortunately, there is no hard and fast rule in determining the size of the bit vector nor the number of hash functions to use and so trial and error is necessary. Experiment with a representative sample of your data set to try and find the ideal number of hashes vs. the size of your bit vector. As a rule of thumb, the more data you inject into the filter and the more bits you set, the larger the bit field needs to be to avoid collisions.

不幸的是，在确定位向量的大小或要使用的哈希函数的数量方面没有硬性规定，因此必须反复试验。用您的数据集的代表性样本进行试验，以尝试找到理想的哈希数与位向量大小的关系。根据经验，注入到过滤器中的数据越多，设置的位数越多，为了避免冲突，位字段就需要更大。

To set the bits, when adding a new item to the Bloom Filter, it is necessary to use different hash functions, one for each bit. Each hash function will generate a new and statistically unique hash value for the datum and the bit position can then be obtained by performing a modulo calculation against the hash using the number of bits in the bit vector. Each hash function should generate a different unique hash value, thus setting a different bit. For example, by using three different hash functions we can set three different bits.

要设置位，在将新项目添加到布隆过滤器时，必须使用不同的哈希函数，每个哈希函数一个。每个哈希函数将为该数据生成一个新的统计上唯一的哈希值，然后可以通过使用位向量中的位数对哈希执行模运算来获得位位置。每个哈希函数应生成不同的唯一哈希值，从而设置不同的位。例如，通过使用三个不同的哈希函数，我们可以设置三个不同的位。

Implementing a Bloom Filter in C++ is pretty simple, with the tricky part deciding on which hash functions to use. This is a decision that needs to be made during the implementation of the Bloom Filter and it’s a good idea to test a number of different hash functions to ensure they give an even distribution of bits within the bit vector for the type of data you plan to add to the filter. You can filter any type of data you like, the only requirement is that the data is suitable for hashing.

在C ++中实现Bloom Filter非常简单，棘手的部分是确定要使用的哈希函数。这是在布隆过滤器的实现过程中需要做出的决定，并且是一个不错的主意，测试许多不同的哈希函数，以确保它们针对要计划的数据类型在位向量中提供均匀的位分布添加到过滤器。您可以过滤喜欢的任何类型的数据，唯一的要求是该数据适合散列。

Once you’ve identified suitable hash functions, it’s just a case of deciding how large your bit vector needs to be and then storing the bit patterns for each datum added into the bit vector. When performing a lookup in the filter we just need to re-generate the bit pattern and see if it exists in the bit vector. If it doesn’t then the datum does not exist in our storage, if it does then the datum might exist in the storage and so a full search will be necessary.

一旦确定了合适的哈希函数，这只是确定位向量需要多大然后存储添加到位向量中的每个数据的位模式的一种情况。在过滤器中执行查找时，我们只需要重新生成位模式并查看它是否存在于位向量中即可。如果不存在，则数据不存在于我们的存储中；如果不存在，则数据可能存在于存储中，因此有必要进行全面搜索。

The full code for a very simple Bloom Filter can be found, below. Although there is quite a lot of code, most of this is just the implementation of a few simple hash functions. The main guts of the Bloom Filter (see the bloom_filter class) is actually very straight forward. In this implementation, the hash functions are passed into the Bloom Filter’s constructor and so this means the filter can use any number of hash functions. The size of the bit vector is fixed at 128, but this can be any size and can even be decided at run-time. The only requirement is that it’s bigger than the number of hash functions. The actual size is dependent on how much data you plan to add to the filter.

可以在下面找到非常简单的Bloom Filter的完整代码。尽管有很多代码，但是其中大多数只是一些简单哈希函数的实现。实际上，Bloom Filter的主要内容（请参见bloom_filter类）非常简单。在此实现中，哈希函数被传递到Bloom Filter的构造函数中，因此这意味着过滤器可以使用任意数量的哈希函数。位向量的大小固定为128，但是可以是任何大小，甚至可以在运行时确定。唯一的要求是它大于哈希函数的数量。实际大小取决于您计划添加到过滤器中的数据量。

#include <cstdint>
#include <string>
#include <vector>
#include <memory>
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
// Original hash function implementations and descriptions can be found here:
// http://www.eternallyconfuzzled.com/tuts/algorithms/jsw_tut_hashing.aspx
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
class hash
 {
public:
    virtual ~hash() = default;
 
    virtual uint32_t operator() (void const * key, size_t len) const = 0;
 
    uint32_t operator() (std::string const & s) const
    {
       return (*this)(s.c_str(), s.size());
    }
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
class djb_hash : public hash
 {
public:
    uint32_t operator() (void const * key, size_t len) const override
    {
       auto p = reinterpret_cast<unsigned char const *>(key);
       auto h = uint32_t(0);
 
       for(auto i = size_t(0); i < len; i++)
       {
          h = 33 * h + p[i];
       }
 
       return h;
    }
 
    using hash::operator();
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
class sax_hash : public hash
 {
public:
    uint32_t operator() (void const * key, size_t len) const override
    {
       auto p = reinterpret_cast<unsigned char const *>(key);
       auto h = uint32_t(0);
 
       for(auto i = size_t(0); i < len; i++)
       {
          h ^= (h << 5) + (h >> 2) + p[i];
       }
 
       return h;
    }
 
    using hash::operator();
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
class fnv_hash : public hash
 {
public:
    uint32_t operator() (void const * key, size_t len) const override
    {
       auto p = reinterpret_cast<unsigned char const *>(key);
       uint32_t h = 2166136261;
 
       for(auto i = size_t(0); i < len; i++)
       {
          h = (h * 16777619) ^ p[i];
       }
 
       return h;
    }
 
    using hash::operator();
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
class oat_hash : public hash
 {
public:
    uint32_t operator() (void const * key, size_t len) const override
    {
       auto p = reinterpret_cast<unsigned char const *>(key);
       auto h = uint32_t(0);
 
       for(auto i = size_t(0); i < len; i++)
       {
          h += p[i];
          h += (h << 10);
          h ^= (h >> 6);
       }
 
       h += (h << 3);
       h ^= (h >> 11);
       h += (h << 15);
 
       return h;
    }
 
    using hash::operator();
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
// The actual Bloom Filter
class bloom_filter
 {
public:
    static size_t const bit_count = 128;
 
    bloom_filter(std::vector<std::shared_ptr<hash>> && hashers)
       : hashers_(std::move(hashers))
       , bits_(bit_count)
    {
    }
 
    void add(std::string const & s)
    {
       for(auto const hasher : hashers_)
       {
          size_t idx = (*hasher)(s) % bit_count;
          bits_[idx] = true;
       }
    }
 
    bool exists(std::string const & s)
    {
       for(auto const hasher : hashers_)
       {
          size_t idx = (*hasher)(s) % bit_count;
          if(!bits_[idx])
          {
             return false;
          }
       }
 
       return true;
    }
 
private:
    std::vector<std::shared_ptr<hash>> hashers_;
    std::vector<bool> bits_;
 
 };
 
// =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
 
int main()
 {
    auto && bfilt = bloom_filter{
       std::vector<std::shared_ptr<hash>>{
          std::make_shared<djb_hash>(),
          std::make_shared<sax_hash>(),
          std::make_shared<fnv_hash>(),
          std::make_shared<oat_hash>(),
       }
    };
 
    bfilt.add("hello");
    bfilt.add("world");
 
    // these should exist
    assert(bfilt.exists("hello"));
    assert(bfilt.exists("world"));
 
    // these should not exist
    assert(!bfilt.exists("foobar"));
    assert(!bfilt.exists("eggplant"));
 }

The hash functions used in the example are some popular ones that are simple to implement and code for them can be found on the web. The choice of hash functions is really up to you and you should be sure to chose ones that give a good distribution for your data. You can read more on the hashes I used in this example here.

该示例中使用的哈希函数是一些流行的哈希函数，它们易于实现，并且可以在网上找到针对它们的代码。哈希函数的选择实际上取决于您，您应该确保选择能为您的数据提供良好分布的函数。您可以在此处阅读有关本示例中使用的哈希的更多信息。