插入时的循环踢出。由于标准Cuckoo hashing不能预测哪个item有空的备用buckets,只能通过BFS或者是随机的方式去决策。会浪费时间不说,还可能造成死循环。A good strategy should find a solution fast if such solution exists
一个循环中多桶检查:查找一个item的时候,所有的备用桶都需要访问,影响了查找性能,尤其在表很大,需要把表放在外部内存的情况下。narrow down the subset of buckets that may contain the item beforehand and optimize the accessing pattern
将stash on-chip?减少对性能的影响,当stash本身满的时候,其中的items会尝试插入主表When the stash itself is full, items stored in it will take a try to the main table until some space is freed. A small stash of size 4 is regarded as enough to achieve rather high load (for example 95% in [24]) with high probability.?:Cuckoo hashing with a stash (CHS)[22] propose 已经很高了,好奇如何提升emm
多桶查询
当实施的平台上要用很小很快的on-chip memory去处理很大的表,(for example the ASIC/FPGA/SOC based packet processing devices).每次要检查多个位置就会成为大问题。
multiple copies of item,将item同时放入所有可用桶中。不用随机选择一个可用桶插入 通过冗余度可以很清楚地知道冲突的时候替换谁是最优方案,加速了插入速度,也避免了死循环。Keeping copies in all the available candidate buckets will maintain the flexibility and avoid entering the sub-optimal situation early: the optimal placement will come out naturally later on when the other occupied buckets are appropriately given away as per request to new items, who turn out to be the better owners of these buckets in an overall optimal arrangement.
查找的时候,由于同一个item的所有桶的计数器应该一样,这个特点可以被用于排查掉不可能的情况。比如其中一个桶的c值是0,那么该item一定不存在。Furthermore, because an item can always overwrite a redundant copy to settle down, if a lookup fails with any candidate bucket having counter value larger than 1, we know that item must have not been inserted before and skip checking the stash. These 通过这些观察,可以避免查询一些bucket,或者是对stash的不必要检查。
McCuckoo特别适合于主表只能放在慢一点的二级存储上,上述三个问题可以被统一解决。为了最大化counters的好处,要把它们放在on-chip embedded memory. a compact on-chip counter array。操作逻辑也很简单。
As other answers have pointed out, it’s true that the simplest cuckoo hashtable requires that the table be half (需要51%的空余来保证插入时一个比较好的性能)empty. However, the concept has been generalized to d-ary cuckoo hashing, in which each key has d possible places to nest, as opposed to 2 places in the simple version. The acceptable load factor increases quickly as d is increased. For only d=3, you can already use around a 75% full table. The downside is that you need d independent hash functions. I’m a fan of Bob Jenkins’ hash functions for this purpose (see http://burtleburtle.net/bob/c/lookup3.c), which you might find useful in a cuckoo hashing implementation. from Stackoverflow
Let’s say you have h(x)=x mod m. Then an adversary can choose keys: 1, 1+m, 1+2m, 1+3m etc. They will all cause collisions. But if the function h is chosen randomly from a universal family, you can’t guarantee such a “bad” set of keys.
占用所有的buckets避免了冲突时麻烦的踢出,之前的算法被认为是reactive由于更关注kick-out而不是解决碰撞? McCuckoo takes a proactive approach by keeping the decision on placement open until a more suitable item later on claims one of the buckets and replace the copy in it. McCuckoo的一个理念是尽可能久地保持所有items的冗余。迟早会大家都只有一份饿了,这个时候用已有的冲突解决机制random-walk或MinCounter来start relocating items。 计数器来跟踪数目。用bit计数,d很少,logd就更小了。(d=3的时候载入率就高于90%了)计数器的作用如下:
根据简单的逻辑来决定哪个桶可以拿来用,不需要去访问桶
在查找的时候排除掉不可能的桶,利用合法的candidate buckets应该有相同的计数值
标记删除,而不需要真的把item移出去
Design Principles
insertion
总体上让每个item占用更多的bucket。只有当所有的candidate buckets含有一份别的items时,才真的出现碰撞。因此想让表中任意一个item的备份数降到1的速度越慢越好 e.g. 1 1 1意味着三个bucket,每个都被只享有一个bucket的item给占据了 插入的方案:
占领所有空的可占buckets
永远不会动计数值为1的buckets
按照counter值的降序去占领可占bucket,until the overwriting results in more copies for the inserted item than the overwritten one. 3 2 1 B C A 2 2 2 A C A (同时B占领的桶都改为2),此时如果再去占领C在的桶 3 3 3 A A A (C占领的桶的计数值都改为1)
3 2 2 B C A 3 2 3 A C A (此时B占领的桶都改为2) 2:3和3:2对全局没什么影响 定理1. 上述插入准则达到全局最优冗余平衡 证明:省略(不是作者省略,是我) 定理2:表中所有冗余写入不会超过
1
+
Σ
3
d
1
/
t
1+\Sigma_{3}^{d} 1 / t
1+Σ3d1/t倍的表的大小 证明:省略。当d=3时,冗余数不会超过表大小的5/6。这个定理也告诉我们大多数冗余发生在表建立起来的时
查找
with some false positive error but no false negative??? on-chip counters用于减少对主表的查找。
减少对存在的items查询时的内存访问。由于同一个item占用的所有buckets会有相同的计数值,因此将candidate buckets按照计数值分组,组内bucket个数是s, bucket counter值是c(和自己一样的一共有几个),如果s<c,那么这个组的bucket都不用查了。此外,for the buckets that share a common matching value, checking just one of them is sufficient to return the right answer?我知道它存在,就一定这样可以吗? 如果有3个buckets的c值都是2,就不能排除掉任何一个。可能是这个item确实有两个备份,这个时候只要查找任意两个就能知道到底在不在了。 类似于这样的观察可以减少不必要的磁盘访问。在实际中,可以实现0次货1次访问,尤其当表适当地loaded时候。
blocked version of Cuckoo hashing:一个桶分为l个slots,可以将表的负载提升到很接近100%。 右边每张表的大小是原来的1/3?To accommodate the same number of n items, now the length of each table is reduced to roughly one third of the size as before, which is m/3. ? 现在是每个slot有一个counter。 现在的问题是item placement不能简单地由counters跟踪了 比如,之前我们要更新item占用的桶,只要知道了桶就很容易知道计数器。但是现在知道这个copy在哪个桶里不足以定位计数器,因为可能在任意一个slot里,现在的on-chip结构是没法跟踪到的。为了正确性,对于每份copy,需要去把它从bucket里读出来看它到底在哪个slot里。简单地说,就是当不管插入、查询还是什么,需要知道candidate bucket的c值时,我得把另外bucket给遍历一遍看它在哪个slot,才能获得C值。 一种减少off-chip访问的方式如图5:在主表里同时记录另外的copy对应的slot号。因此当我们想要更新一个item的copies时,我们只需要从主表中读取。对一个d-hash l-slot的McCuckoo,额外的内存开销是每个slot(d-1)log(l) bits(要存d-1个位置,位置可能从1~l,因此用log(l)bit来存) 把多-slot迁移过来会面临很多问题,为了效率只处理简单的,不处理复杂的😂 只提了multi-slot extension的三个重要的改变
?? McCuckoo can also support multiset, but not by distributing items of the same key among that keys multiple copies, because those redundant copies should always be identical; instead it can act as an indexing structure pointing to the address where all those items are actually stored.
All the simulations were carried out on a machine with 4-cores (8 threads, Intel Core i7@2.60 GHz) and with 8 GB of DRAM memory. The target platform is an Altera Stratix V GX FPGA from Intel with 4.5MB on-chip SRAM with support to external DDR3 SDRAM memory at 800Mhz. In our FPGA implementation, the logic runs at 333Mhz and the DDR3 memory controller runs at 200Mhz, respectively.
DataSet and Implementation
DocWords:This dataset includes five text collections in the form of bag-of-words and we choose the one collected from NYTimes news articles. It contains approximately 70 million items in total. The DocID and WordID are combined to form the key of each item and inserted into the hash tables. BoW(词袋)模型详细介绍 Bag-of-words model (BoW model) 最早出现在自然语言处理(Natural Language Processing)和信息检索(Information Retrieval)领域.。该模型忽略掉文本的语法和语序等要素,将其仅仅看作是若干个词汇的集合,文档中每个单词的出现都是独立的。BoW使用一组无序的单词(words)来表达一段文字或一个文档.。 基于文本的BoW模型的一个简单例子如下:
首先给出两个简单的文本文档如下:
John likes to watch movies. Mary likes too.
John also likes to watch football games.
比较了不同载入率下每次插入的踢出数,以及涉及到的对内存的读写。(快速找到可插位置和相关cost)。 同时记录了第一次kick-out和第一次插入失败时候的载入率,因为when a hash table is filling up with items, increased availability can defer the occurrence of collisions and insertion failure
figure 9 a 为什么没有往后画,是不是性能崩了 读可以降低到0,因为可以在on-chip中就知道哪些bucket是空的,Furthermore, the multiplier effect is more severe for the single-copy schemes if we compare Fig.10a with Fig. 9, because they need to read back each candidate bucket to know if they are empty or not during the kick-outs, while the multi- copy schemes can figure out the empty buckets with the on- chip counters.
关于写,McCuckoo方案在低载入率的时候要比single copy的方案要高。在高载入率的时候,writes caused by kick-out会增加。对于单copy的,转折点发生在50%,对于McCuckoo,要更高一点点。 . The cross-over happens at about half load for single-slot schemes and at a bit higher load for the multi-slot schemes, which means for the most likely working conditions of Cuckoo with the table moderately to heavily loaded, the number of writes is also lower with the multi-copy schemes. Since more reads will take place than writes during an insertion, the total number of accesses at higher load ratio is much reduced in the multi-copy schemes. McCuckoo可以更久地保持collision-free的状态 第一次插入失败是很关键的,因为从这之后,more insertions will stop at maxloop which costs heavily for us。
将maxloop从50改到500,可以发现,越高的maxloop可以获得越高的负债率,但是插入失败同时也带了更大的罚时but also induce heavier penalization if an insertion still fails after the lengthy trial.
From the figure we can see that with multi-copy we can reach higher load ratio free of insertion failures with the same maxloop, or reach the same load ratio with smaller maxloop values than the single-copy schemes.
Lookups for Existing and No-existing Items
B-Cuckoo负载多了就要像普通的方式那样去查 可见插入性能优越
Deletion
删除性能不太好,但是删除操作少,影响不大
Stash at High Load Ratio
由于第一次踢出把stash放在off-chip,因此么得对比 show the necessity of a bigger stash and the feasibility of putting one off-chip 模拟a McCuckoo table and a blocked McCuckoo table在非常接近于最大负载,主表非常拥挤,每个表有五个参数,当前负债率,maxloop,stash中的item数,占插入总数的比例,到stash中查询不存在的item的比例。 从实验结果中可以看到,大的stash是很有必要的。将item插入到拥挤的表格中非常有可能落入到stash中,除非可以接受一次rehash。预筛机制也很好地避免了大多数不必要查询。
Latency and Throughput
硬件上,和软件好像有点不一样 The evaluation on latency and throughput is based on a Altera FPGA development board that can run the McCuckoo logic and access the on-chip SRAM at 333Mhz. Hash calculation and the logic is implemented in hardware that can be performed in 1CLK. The on-chip SRAM can be read in 3CLK and written in 1CLK. For the off-chip DDR3 SDRAM, the controller is clocked at 200Mhz, and read costs about 18CLK on average and write costs 1CLK. 读时延要大于写时延,写完就可以返回执行下一个指令,但是读还要等数据从外部内存回来。同时,当记录的size小的时候,skip checking some buckets在读时延上没有明显的区别,因此McCuckoo的好处就不是那么显著(while the time to access the on-chip counters becomes relative large because they need to be checked all the time)
随机游走 The key question is which item to move if the d potential locations for a newly inserted item x are occupied. A natural approach in practice is to pick one of the d locations randomly, replace the item y at that location with x, and then try to place y in one of its other d − 1 location choices [6]. If all of the locations for y are full, choose one of the other d−1 locations (other than the one that now contains x, to avoid the obvious cycle) randomly, replace the item there with y, and continue in the same fashion. At each step (after the first), place the item if possible, and if not randomly exchange the item with one of d − 1 choices. We refer to this as the random-walk insertion method for cuckoo hashing. 就是随机选一个踢掉啦,起了那么一个高大上的名字orz An Analysis of Random-Walk Cuckoo Hashing