Cuckoo Filter

最新推荐文章于 2024-02-08 22:28:19 发布

guanyue.space

最新推荐文章于 2024-02-08 22:28:19 发布

阅读量426

点赞数

分类专栏：笔记

本文链接：https://blog.csdn.net/qq_34620855/article/details/118161746

版权

笔记专栏收录该内容

19 篇文章 0 订阅

订阅专栏

Cuckoo 杜鹃：把蛋产在其他的鸟巢中（将其他的鸟蛋踢出）
HashMap的变种：利用hashmap存储数据，但是对原数据不进行存储而是存储 finger print 信息

Bloom Filter:

无FN: 元素若存在于集合中即判定为存在
存在FP:元素不存在于集合时可能判断为存在

天然缺陷：无法删除元素{删除的一位可能与多个元素相关}；当元素数过多时容易错判{FP}

Hash Map

待存储数据（key,value）

插入：buckets_position= hash(key) 当冲突较少时，将元素插入到链表中；否则将链表转换为红黑树进行存储（Java:当链表中元素数达到 static final int TREEIFY_THRESHOLD = 8;）

Cuckoo Hash Tables

论文原文

each item has two candidate bucketss determined by hash functions $h_1(x) \enspace and \enspace h_2(x)$ .

当两个位置均有元素时：the item selects one of the candidate buckets (e.g., bucket 6), kicks out the existing item (in this case “a”) and re-inserts this victim item to its own alternate location.

在这里插入图片描述

Cuckoo hashing ensures high space occupancy because it refines earlier item-placement decisions when inserting new items.

Most practical implementations of cuckoo hashing extend the basic description above by using buckets that hold multiple items, as suggested in.

With proper configuration of cuckoo hash table parameters, the table space can be 95% filled with high probability.

Cuckoo Filter

区别于cuckoo hash的是只使用了一个hash函数，存储的两个位置 $i_1,i_2$ 之间存在如下关系
$i_1=i_2\oplus hash(fingerprint)$

It supports adding and removing items dynamically;
It provides higher lookup performance than traditional Bloom filters, even when close to full (e.g., 95% space utilized);
It is easier to implement than alternatives such as the quotient filter(商过滤器);
It uses less space than Bloom filters in many practical applictions, if the target false positive rate $\epsilon$ is less than 3%.

在这里插入图片描述

插入操作：
在这里插入图片描述

描述：每个元素e对应两个位置 $i_1,i_2$ ,当 $buckets[i_1] \enspace or \enspace buckes[i_2] is \enspace empty$ 直接存入；当两者均不空时，随机占用 $i=random(i_1,i_2)$ ,并将原存储于 $i$ 的信息踢出，存入到新的地址 $i'=i\oplus hash(f)$ , 若新地址 $i^{'}$ 中原存有信息则将其按如上规律抢占其他元素位置；若执行抢占（踢出）次数多于 $M a x N u m K i c k s$ 时，则插入失败。

Hashing the fingerprints ensures that these items can be relocated to buckets in an entirely different part of the hash table, hence reducing hash collisions and improving the table utilization

查找操作：
在这里插入图片描述

只查询当前的两个位置 $i_1,i_2$ , Notice that this ensures no false negatives as long as bucket overflow never occurs.

删除操作：
在这里插入图片描述

It also avoids the “false deletion” problem when two items share one candidate bucket and also have the same fingerprint.

For example, if both item x and y reside in bucket i1 and collide on fingerprint f.
When deleting x, it does not matter if the process removes the copy of f added when inserting x or y.
After x is deleted, y is still perceived as a set member because there is a corresponding fingerprint left in either bucket $i_1$ and $i_2$ .

x已经在set中后插入y 的操作？ { $\enspace hash(x)=hash(y) \& fingerprint(x)=fingerprint(y)$ }
无脑新增副本？ delete没问题 but 若有3个或以上冲突，则插入失败。但岂不是认为x仍在set中

This is the expected false-positive behavior of an approximate set membership data structure, and its probability remains bounded by $\epsilon$ .

Note that, to delete an item x safely, it must have been previously inserted.
Otherwise, deleting a non-inserted item might unintentionally remove a real, different item that happens to share the same fingerprint.
This requirement also holds true for all other deletion-supporting filters.

简单分析

Here we show that using partial-key cuckoo hashing to store fingerprints in a cuckoo filter to leads to a lower bound on fingerprint sizes that grows slowly with the filter size.

指纹越长辨识率越高，从而避免Hash碰撞带来的影响，但是相应的降低了空间利用率

在这里插入图片描述
Minimum Fingerprint Size: $\varOmega(logn /b)$

the candidate buckets for each item are not in dependent.

For example, assume an item can be placed in bucket $i_1$ or $i_2$ .
According to $i_1=i_2 \oplus hash(fingerprint)$ , the number of possible values of $i_2$ is at most $2^f$ when using $f - b i t$ fingerprints

For a table of $m$ buckets, when $2^f < m$ (or equivalently $f < \log_2^m$ bits), the choice of $i_2$ is only a subset of all $m$ buckets of the entire hash table. (需要遍历整个buckets作为 $i_2$ )

if the hash table is very large and stores relatively short fingerprints, the chance of insertion failures will increase due to hash collisions, which may reduce the table occupancy

$q$ 个items发生碰撞的概率:
假定 $\enspace x$ 有 $\enspace i_1 \& fingerprint: \enspace t_x$ ,

若另外的 $q - 1$ 个有以上相同的buckets，他们也必将有相同的 finger print{相同的fingerprint概率： $\frac 1 {2^f}$ }
另外若他们的buckets为 $i_1$ 或 $i_1 \oplus hash(t_x)$ 的概率为 $\frac 2 m$

因此这 $q$ 个 items拥有相同的buckets的概率为 $(\frac 2 m \times \frac 1 {2^f})^{q-1}$

$n$ 个任意元素插入到 $m$ 个buckets(初始为空)的情形，其中 $m = c n$ ：

c为常量，constant bucket size:b 并且当有 $q = 2 b + 1$ 个元素映射到相同的两个 buckets时，插入失败。 $\quad \quad$ 每个元素只映射到两个buckets,故 $2 b + 1$ 个元素同时插入时，必有一个被踢出，从而陷入踢出循环，最终插入失败！！！

假设q个元素发生了collision那么其概率为： $(\frac {2}{2^f \cdot cn})^{2b}$ ,那么插入的n个元素恰好时是 q 个之中的概率 $\tiny{在m个buckets中插入操作}$ ：
$\binom{n}{2b+1}\bigg(\frac {2}{2^f \cdot m}\bigg)^{2b}=\binom{n}{2b+1}\bigg(\frac {2}{2^f \cdot cn}\bigg)^{2b}=\textcolor{red}{\varOmega \bigg(\frac {n}{4^{bf}}\bigg)}$

$\varOmega(n): 时间复杂度表示法$

We conclude that $4^{bf}$ must be $\varOmega(n)$ to avoid a non-trivial probability of failure , as otherwise this expectation is $\varOmega(1)$ .
Therefore, the fingerprint size must be $\textcolor{red}{f=\varOmega(\frac {\log_4^n}{b})=\varOmega(log n/b)}$ bits.
As long as we use reasonably sized buckets( $b$ ), the fingerprint size can remain small.

recall that Bloom filters use a constant (approximately $ln(1/\epsilon)$ bits) per item

Empirical Evaluation:
在这里插入图片描述
After that{增加f=8之后}, increasing the fingerprint size has almost no return in term of improving the load factor (but of course it reduces the false positive rate).

Overall, short fingerprints appear to suffice for realistic sized sets of items
Once fingerprints exceed six bits, α approaches the “optimal load factor” that is obtained in experiments using two fully independent hash functions. {当f=6时就能达到使用两个独立的hash函数的能力}

Optimal Bucket Size

Keeping a cuckoo filter’s total size constant but changing the bucket size leads to two consequences
1. Larger buckets improve table occupancy (i.e., higher b → higher α)
2. Larger buckets require longer fingerprints to retain the same false positive rate (i.e., higher b → higher f).

一个buckets中存放b个entrys时，在一次查询时发生误报时 $\tiny{误报：不存在判定为存在}$ 的概率为 $\frac {1}{2^f})^{2b}\approx 2b/2^f$

根据FPR的上界 $\epsilon$ 则 $\frac {2b} {2^f}\le \epsilon\\亦即：\textcolor{red}{f\ge \lceil \log_{2}^{2b/\epsilon}\rceil=\lceil \log_2^{1/\epsilon}+\log_2^{2b}\rceil} bits\\在load\enspace factor=\alpha那么每项平均占用Cost\le \frac f \alpha=\frac {\lceil \log_2^{1/\epsilon}+\log_2^{2b}\rceil} \alpha$

Optimal bucket size b=4
在这里插入图片描述
when $\epsilon > 0.002$ , having two entries per bucket yields slightly better results than using four entries per bucket;
when $\epsilon$ decreases to $\epsilon \le 0.002$ , four entries per bucket minimizes space.

In summary, we choose (2, 4)-cuckoo filter (i.e., each item has two candidate buckets and each bucket has up to four fingerprints) as the default configuration,
because it achieves the best or close-to-best space efficiency for the false positive rates that most practical applications may be interested in.

Semi-sorting Buckets to Save Space

The following example illustrates how the compression saves space.
Assume each bucket contains b = 4 fingerprints and each fingerprint is f = 4 bits (more general cases will be discussed later).

An uncompressed bucket occupies 4×4 = 16 bits.
However, if we sort all four 4-bit fingerprints stored in this bucket (empty entries are treated as storing fingerprints of value “0”), there are only 3876 possible outcomes in total (the number of unique combinations with replacement).

严格按照上述算法，插入数据时若重复则也直接插入一份拷贝,sorted entrys互不相同 $\binom{16}{4}$ ,有一个entrys相同 $\binom{16}{3}\times3$ …
$3876=\binom{16}{4}+\binom{16}{3}\times3 +\binom{16}{2}\times 2+ \binom{16}{1}$
entry：4bits buctet: 4 entrys 故所有的buckets的可能共有3876
思想：建立索引表，用序号来代替具体的bucket情形，以减少空间占用

If we precompute and store these 3876 possible bucket-values in an extra table, and replace the original bucket with an index into this precomputed table, then each original bucket can be represented by a 12-bit index ( $2^{12} = 4096 > 3876$ ) rather than 16 bits, saving 1 bit per fingerprint.

但是这样做对lookup操作效率产生影响，查找索引表对应的具体bucket数据

Therefore, to achieve high lookup performance it is important to make the encoding/decoding table small enough to fit in cache.

COMPARISON WITH BLOOM FILTER

在这里插入图片描述

Cuckoo filter: $(\log_2^{1/\epsilon}+\log_2^{2b})/\alpha$ 此处bucket size=4 , okay

Number of Memory Accesses:
For Bloom filters with k hash functions, a positive query must read k bits from the bit array.

For space-optimized Bloom filters that require $log2(1/\epsilon)$ , as $\epsilon$ gets smaller, positive queries must probe more bits and are likely to incur more cache line misses when reading each bit.
For example, k = 2 when $\epsilon$ = 25%, but k is 7 when $\epsilon$ = 1%, which is more commonly seen in practice.

A negative query to a space optimized Bloom filter reads two bits on average before it returns, because half of the bits are set.
Any query to a cuckoo filter, positive or negative, always reads a fixed number of buckets, resulting in (at most) two cache line misses.