Sketch初见之QF

1.Quotient filter是什么

Quotient filter是一种空间利用高效的基于概率的数据结构(probabilistic data structure PDS),它可以验证一个元素是否在集合S中,使用近似成员查询滤波器(approximate member query filter AMQ);它可以判断一个元素不在该集合S中,不存在假阴性(false negatives),或者判定该元素可能在集合S中,这是因为该结构存在一定的假阳率(false positive)。我们需要在假阳率和存储空间做出权衡(trade off),因为增加存储空间的大小会降低假阳率,更多的元素被加入到集合S中也会提高假阳率。其中AMQ操作包括插入(insert)和选择性删除(optionally delete)。

Quotient filters和AMQ filters最典型的应用场景就是:充当磁盘(disk)上数据库(database)中key的代理(proxy)。当一个key被添加或从数据库中被移除,过滤器就会更新来反映这一变化。任何的查找操作,将首先访问quotient filters,只有当quotient filter给出反馈该key在集合中,才会执行database中数据查询这一后续操作(presumably much slower);如果quotient filter给出不在集合S中就不必执行磁盘上的查询这一操作。

An approximate member query (AMQ) filter used to speed up answers in a key-value storage system. Key-value pairs are stored on a disk which has slow access times. AMQ filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the AMQ filter than without it. Use of an AMQ filter for this purpose, however, does increase memory usage.

Quotient filter除了有通常的AMQ的插入和查询操作,它还支持合并和重新调整集合大小的re-size操作,不需对已有的key进行re-hash,因此避免了对二级存储的操作。这一性质有利于特定种类的日志结构合并树(log-structured merge-trees1)

1. Algorithm description2

filter的组成:Quotient filter doesn’t store the element itself, an only p-bit fingerprint is stored.

  • p p p-size (in bits) for fingerprints
  • 1 hash function that generates such fingerprints

Quotient filter is represented as a compact open hash table with m = 2 q m=2^q m=2q buckets

  • the fingerprint f f f is partitioned into:
    • the r r r least significant bits( f r = f   m o d   2 r f_r=f\ mod \ 2^r fr=f mod 2rthe remainder)
    • the q = p   −   r q=p\ -\ r q=p  r most significant bits( f q = ⌊ f 2 r ⌋ f_q = \lfloor \frac{f}{2^r} \rfloor fq=2rfthe quotient)

f: the fingerprint; f q f_q fq:the quotient,商数; f r f_r fr:the remainder,余数

The remainder is stored in a bucket indexed by the quotient. Each bucket contains 3 bits, all 0 at the beginning: is_occupied, is_continuation, is_shifted.先计算商确定第一位置,再计算余数,根据冲突与否计算具体位置的偏移。

If two fingerprints f f f and f ′ f' f have the same quotient ( f q = f q ′ f_q=f'_q fq=fq) - it is a soft collision. All remainders of fingerprints with the same quotient are stored contiguously in a run.一个run是具有相同商数的连续序列。

状态指示:

在这里插入图片描述

is_occupied:is set when the bucket j j j is the canonical bucket ( f q = j f_q=j fq=j) for some fingerprint f f f, stored (somewhere) in the filter.根据商数确定is_occupied

is_continuation:is set when the bucket is occupied but not by the first remainder in a run.具有相同商的余数右移产生的序列,根据商确定is_continuation

is_shifted:is set when the remainder in the bucket is not in its canonical bucket.计算出来的商+余数不在该slot,是由于连续相同商产生continuation右移得到的bucket。


To test an element

  • 根据哈希函数计算特征(Apply the hash function the to element and calculate fingerprint f f f)
  • 根据选择的商余准则,计算商数 f q f_q fq,余数 f r f_r fr
  • 如果 f q f_q fq是not occupied,该元素不在集合
  • 如果 f q f_q fq是is occupied
    • f q f_q fq位置,左侧查找,直到is_shifted位是0,记录左侧有几个is_occupied位(这一步是为了找出对应 f q f_q fq这个bucket右移的位置,其实就是找出 f q f_q fq从左侧开始是第几个run序列)(starting with bucket f q f_q fq, scan left to locate bucket without set is_shifted bit)
    • 接着第二步,从右侧开始查找,找到 f q f_q fq对应的bucket位置(scan right with running count ( i s _ o c c u p i e d : + 1 , i s _ c o n t i n u a t i o n : − 1 is\_occupied:_{+1}, is\_continuation:_{-1} is_occupied:+1,is_continuation:1) until the running count reaches 0 - when it’s the quotient’s run.)
    • 找到 f q f_q fq的bucket,在它的run序列进行余数 f r f_r fr的比较(compare the stored remainder in each bucket in the quotient’s run with f r f_r fr)
    • 找到了,该元素可能在集合中;没找到该元素一定不在集合中(if found, than element is (probably) in the filter, else - it is definitely not in the filter.)

To add an element

  • 根据哈希函数计算出散列值(Apply the hash function the to element and calculate fingerprint f f f)
  • 计算出商数 f q f_q fq和余数 f r f_r fr,(Calculate quotient f q f_q fq and remainder f r f_r fr for the fingerprint f f f)
  • 找到 f q f_q fq对应的bucket的run序列,按照 f r f_r fr大小顺序插入该元素(Choose bucket in the current run by keeping the sorted order and insert remainder f r f_r fr (set is_occupied bit))

Run and Cluster

一个run是具有相同商数 f q f_q fq的连续序列集合

为何引入run?

​ run序列帮我我们确定元素的起始空间

一个cluster是左侧起始于原始位置100,后续为连续的XX1序列集合;

​ cluster序列帮我们确定一个元素位置的移动情况,其中序列所有元素的shift都被置一

2. Example for Quotient filter

Consider Quotient filter with quotient size q = 3 q=3 q=3 and 32-bit signed MurmurHash3 as h h h.

在这里插入图片描述

  1. Add elements amsterdam, berlin, london:

    • f q f_q fq(amsterdam)=1, f r f_r fr(amsterdam)=164894540
    • f q f_q fq(berlin)=4, f r f_r fr(berlin)=-89622902
    • f q f_q fq(london)=7, f r f_r fr(london)=232552816

    Insertion at this stage is easy since all canonical slots are not occupied. We just store our reminder in their canonical slots.

在这里插入图片描述

  1. Add element madrid: f q f_q fq(madrid)=1, f r f_r fr(madrid)=249059682.

    slot_1已经是occupied,比较余数大小 f r f_r fr(madrid)更大,它占据slot_2,同时将is_continuation和is_shifted置一表明该元素是右移产生,修改run和cluster大小

    (The canonical slot 1 is already occupied. The shifted and continuation bits are not set, so we are at the beginning of the cluster which is also the run’s start.

    The reminder f r f_r fr(madrid) is strongly bigger than the existing reminder, so it should be shifted right into the next available slot 2 and shifted bit and continuation bit should be set (but not the occupied bit, because it pertain to the slot, not the contained reminder).)

在这里插入图片描述

  1. Add element ankara: f q f_q fq(ankara)=2, f r f_r fr(ankara)=62147742.

    slot_2位置已经被占用,但是is_occupied依然没有被置一,因此我们首先将is_occupied置一,其次ankara也将被右移。slot_3的is_shifted置一,is_continuation不变

    产生一个新的run序列,除此之外商数为1的cluster将继续扩大,因为序列是不间断的shifted序列

    (The canonical slot 2 is not occupied, but already in use. So, the fr(ankara)fr(ankara) should be shifted right into the nearest available slot 3 and its shifted bit should be set. In addition, we need to flag the canonical slot 2 as occupied by setting the occupied bit.)

    在这里插入图片描述

  1. Add element abu dhabi: f q f_q fq(abu dhabi)=1, f r f_r fr(abu dhabi)=-265307463.

    其商数为1,且余数最小,故原本处于slot_1的元素右移,slot_1的run和cluster继续变化;这时的cluster变化很多,因为更多的元素有了is_shifted操作

    变化前 100 111 001 100

    插入新元素后 100 111 011 101 001

    在这里插入图片描述

  1. Test element ferret: f q f_q fq(ferret)=1, f r f_r fr(ferret)=122150710.

    先根据商数找到slot_1。发现这就是run的起点,只需要比较后续的余数即可

    f r f_r fr(ferret) bigger than slot_1 && smaller than slot_2,该元素不在该集合中

  2. Test element berlin: f q f_q fq(berlin)=4, f r f_r fr(berlin)=-89622902.

    现根据商数,找到slot_4,发现is_occupied,继续下一步

    is_shifted位置一,说明商数为4的序列右移,具体不知道右移几位

    从当前位置向左侧查找,发现左侧的slot_2和slot_1的is_occupied位置一,且slot_1情况为100说明是起点,得到:slot_4对应cluster的**第3个run序列**

3. QF与BF的比较
  • Quotient filters are about 20% bigger than Bloom filters, but faster because each access requires evaluating only a single hash function.
  • Results of comparison of in-RAM performance (M. Bender et al.):
    • inserts. BF: 690 000 inserts per second, QF: 2 400 000 insert per second
    • lookups. BF: 1 900 000 lookups per second, QF: 2 000 000 lookups per second
  • Lookups in Quotient filters incur a single cache miss, as opposed to at least two in expectation for a Bloom filter.
  • Two Quotient Filters can be efficiently merged without affecting their false positive rates. This is not possible with Bloom filters.
  • Quotient filters support deletion.

Paper2参考链接

[Quotient filter解析](gakhov - Articles - Probabilistic data structures. Quotient filter)

[PDS详细介绍书籍和github项目](gakhov/gakhov: Andrii Gakhov (github.com))

//www.gakhov.com/articles/quotient-filters.html))

[PDS详细介绍书籍和github项目](gakhov/gakhov: Andrii Gakhov (github.com))


  1. Log-structured merge-tree - Wikipedia ↩︎

  2. gakhov - Articles - Probabilistic data structures. Quotient filter ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值