[论文笔记] Coloring Embedder

最新推荐文章于 2022-12-06 12:30:00 发布

iroy33

最新推荐文章于 2022-12-06 12:30:00 发布

阅读量427

点赞数

分类专栏：网络

本文链接：https://blog.csdn.net/iroy33/article/details/103423026

版权

网络专栏收录该内容

11 篇文章 1 订阅

订阅专栏

不负责任地说一句，我的博客中会有很多帮助自己理清思路的乱七八糟的东西。

问题背景

多集合查询：Given k sets S1, S2 … Sk with no intersection and an element e from one of those sets, multi-set query is to query which set e belongs to.
$f (e) = i$ if $\in S_{i}$ ， i也被定义为e的集合ID。
这个问题的应用有很多：indexing in data centers, distributed file system分布式文件系统, database indexing, data duplication，网络包处理[11,43,41]，网络流量测量[12,39] ！！
Use case 1: 分布式缓存。最经典的分布式缓存是Summary Cache，有多个代理缓存，每个代理保持了a compact summary of 每个其他的代理缓存的缓存内容。当一个缓存miss了，首先检查所有的summaries，是否请求可以在别的caches里命中，将查询信息发送到那些代理whose summaries show positive results. [44,46]。也就是要知道请求的东西在哪个proxy caches里面。

Use case 2：MAC table query. 在数据中心，每个来的包要知道往哪个出端口转发。A query on a MAC table can be seen as a multi-set query. Each MAC table entry includes a key (MAC address) and a value (port). In a typical MAC table [2], there are around ten thousand entries and tens of ports, while a switch often has limited memory, so it is challenging to support queries at high line-rate [43]. 也就是要知道表项属于哪个端口集合

关键的三个指标是：查询速度、错误率、和内存使用效率，本文可以同时实现三个。

减少碰撞：首先把元素映射到高维空间来减少哈希碰撞
减少内存占用：dimensional reduction represention，与图着色很相似
理论与实验证明，与最前沿的技术相比，错误率是千分之一，占用的内存更小，查询速度是2倍
代码在添加链接描述

prior Arts and limitations

当multi-set的size很大的时候，哈希表方法需要很大的内存，查询速度不够快。
布隆过滤器的问题是当内存空间比较小的时候不能达到高准确度

exact-match data structure

hash table based solution : 将元素视作关键词，集合ID作为值，构建hash table，准确但是耗能存太大，由于hash碰撞查询时间无界。
完美哈希内存冗余小，牺牲了插入速度，提高查询速度 can hardly support insertions 。Cuckoo hashing查询速度很快，占用内存相对较小但是更新很慢。support slow updating，当load factor太高的时候，更新会失败。

Probabilistic Data Structure

Bloom filter based solution：a compact data structure for membership query problem 速度快，内存占用小，牺牲精度（由于hash碰撞而有高错误率）一个Bloom filter代表一个集合，可以大约查询一个元素是否属于这个集合，但是答案可能是错的。查询多个集合就用多个Bloom filter，但是内存利用率很低，查询速度慢因为要访问多个Bloom filter。

**Bloom Tree [42], Coded Bloom filter [12], Sparsely coded filter [29],**等致力于减少bloom filter的数目，让每个BF代表编码后的set ID的一部分。BF的最优长度与元素个数有关吧，这些方法的内存使用会被set sizes的分布影响，即使在元素已知的情况下。

Combinatorial Bloom filter [22], iSet [33], the Shifting Bloom filter [40], BF变种用一个filter记录不同集合中的元素，不受不同集合大小分布的影响。

BF适用于错误率允许比较高的情况，如果要小于1E-4，就会占用过多的内存(19.13 bits per element using 13 hash functions) 。

本文算法在相当小的内存阈值下即可几乎无差错(2.2 bits per element using 2 hash functions)

Random Graph and Sharp Threshold

随机图指的是边随机的图，有一些很神奇的数学特性，到了某个临界点，会出现某些现象。举例来说，当概率 p 大于某个临界值 pc(n) 后，生成的随机图几乎必然是连通的（概率等于1）。也就是说，对于散落在地上的 n 个纽扣，如果你以这样的概率 p 将两个纽扣之间系上线，那么你拿起一颗纽扣时就几乎能带起所有的纽扣了。本文发现当内存大于某个阈值，成功构建coloring embedder的概率就会很大。这个性质并利用于设置初始内存size。

细节

m个元素，将它们映射到n=cm个桶中（桶指的是能存储一个元素的存储单元）
c=1的时候碰撞会很大，为了减小碰撞，c就要很大。本方案能够在消除碰撞的同时不增加内存消耗。
有两个关键问题：

如何映射元素达到消除碰撞的目的
如何用很小的内存来存储映射方案
有两个关键技术：
hyper mapping
coloring embedding
We first map all elements to a high dimensional space to almost eliminate hashing collisions, and then we perform dimensional reduction to embed the high dimension space into a low dimension space
我们首先将所有元素映射到高维空间以消除哈希冲突，然后执行降维以将高维空间嵌入到低维空间
Suppose there are m elements, we first map them to an empty graph with cm nodes and $cm/2)^2$ edge slots, where c is recommended to be 2.2. Each element is mapped to an edge slot to build an edge, and the set ID is recorded on the edge. Then we embed the graph with $cm/2)^2$ edge slots into a node vector with cm nodes, while keeping the recorded set IDs of all elements accurate.

首先明确一个coloring embedder是包含了邻接表和node array的，一个coloring embedder是一个coloring classifier?

m是元素个数（边），n是buckets数（结点）

查询

只查node array，hash得到的两个buckets颜色不一样，就属于 $S^+$ ,否则属于 $S^-$

算法实现

RDG上色算法

RDG算法基于只要一个图中没有k-core，那么一定能用不多于k个颜色完成上色。
在这里插入图片描述
这样的2-core需要3个颜色

这样的3-core需要4个颜色

肯定是连线越多，需要的颜色越多。这个结论应该算是启发式的叭。

是不是还要先找连通子图？
用CSG来表示连通子图，RDG来表示Recursively Delete or Give up coloring (RDG)
Step 1：先缩点，把negative edges连的两个点合并成一个，这样子图中就只剩下需要上不同颜色的点，就变成了一个图着色问题。
Step 2：如果不剩CSG，跳到step 5。如果CSG大小小于预设值 $\theta=16$ ，跳到step 3，不然跳到step 4。
step 3：CSG比较小，直接用depth-first来上色，上色成功就删掉这个CSG，回到step 2，如果上色失败，那么报告该图不能用4个颜色上色。
step 4：如果没有一个节点度数小于4，那么判定它是4核子图，算法终止。否则，将度数小于4的结点都放入栈中，把它们从CSG中删除，返回step 2。（怎么个删法？删了节点之后不连通了怎么办？边也会删叭，原先度数>=4的也会减小啊）
step 5：将栈中的结点一个个弹出，按序上色

正确性证明：当到达步骤5，一定可以用四种颜色上色。因为如果要在步骤5产生冲突，一个节点必须要有至少四个邻居，也就是说它的度要不小于4，但是不小于4的进不了栈

复杂度分析：处理每个节点的时间复杂度与它的边数有关，每一条边最多会被处理2次（连了两个点），总的时间复杂度是O(n+m)，需要存储所有的节点和边，以及一个最多n个元素的栈，因此总的空间复杂度是O(n+m)

RDG更新算法

该算法命名为RDN（Recursively Delete Neighbor）：
当一个节点需要改变颜色的时候，如果没有备用的颜色给它，将它所有的邻居放一起重新构造一个子图去跑RDG上色算法，如果RDG上色算法失败了，那再把邻居的邻居都加入子图，再跑，这个过程一直循环到成功实现为止。如果子图不能被扩展了，还没完成上色，那么4-core is found and the RDG updating algorithm fails.

两个以上的sets的更新算法

提出了两个方案：

方案1

使用了编码方法，和一次内存存取方案来组织多个coloring embedder

额不想敲公式了orz

构造coded coloring embedder

用log[s]（我怀疑是ceiling(log s)写错了）个bit来对s个集合进行编码。用一个coloring embedder来记录ID的二进制码的每个bit，只需要log[s]个coloring embedder。如果第i个bit是1，第i个coloring embedder用positive edge来记录这个元素，否则用negative edge。这log[s]个coloring embedder整体被称为coded coloring embedder

查询

iRoy的理思路时间：e属于set 5，编码是101，那个第一个coloring embedder里画一条positive edge，第二个coloring embedder里画一条negative edge，第三个CE（coloring embedder）里画一个positive embedder。怎么查询嘞？先去第一个coloring embedder里面， $hash_1(e)$ 和 $hash_2(e)$ 得到两个桶，看他们颜色不一样，说明set ID的第一个bit位是1，否则是0，依次类推。但是这样子，我们就需要访问3个CE，每次都只去找 $hash_1(e)$ 和 $hash_2(e)$ 得到两个桶，显然会减缓速度。
设node array[][]第一维代表是set的第几个bit，第二维放的是桶
$hash_1(e)=0$ 和 $hash_2(e)=2$ ，那么我要查每个coloring embedder，看看第0个和第2个桶的颜色,set ID的第一位就看node array[0][0]，node array [0][2]是不是一样，set ID的第二位看node array[0][1]和node array[2][1]，set ID的第三位看node array[0][2], [2][2]以此类推。

One Memory Access Technique 改进

接下来用One Memory Access Technique 进一步优化查询速度。在上述实现中，需要访问log[s]个CE，延缓了插入、查询、删除的速度。解决方案：重新组织布局。
Word[0]：node array[0][0],node array[1][0],node array[2][0]这个样子
Word[1]: node array [0][1].node array[1][1],node array[2][1]
Word[2]: node array [0][2].node array[1][2],node array[2][2]
这样的Word有log[s]个就OK了，那么我查询的时候只要访问其中的两个。 $hash_1(e)=0$ 和 $hash_2(e)=2$ ，我只要两次读取，即Word[0],word[2]
The above coded coloring classifier with one memory access technique can represent more than 2 sets and reduce the num- ber of memory access to 2. However,使用了很多coloring classifier， they ***suffer from load balancing problem.***？？？？

方案2

在这里插入图片描述

只有一个图，每个元素在图中插入 $log_2s$ 条边。e ∈ S_5：101，在 $hash_1(e)$ 和 $hash_2(e)$ 的位置连positive（1），在 $hash_1(e)+1$ 和 $hash_2(e)+1$ 的位置连negative（0），在 $hash_1(e)+2$ 和 $hash_2(e)+2$ 的位置连positive（1）。

等到查询的时候，看对应 $hash_1(e)$ 和 $hash_2(e)$ 颜色是否相同， $hash_1(e)+1$ 和 $hash_2(e)+1$ 是否相同， $hash_1(e)+2$ 和 $hash_2(e)+2$ 是否相同。When log2 s is smaller than the length of a machine word, we can answer multi-set query with only two memory accesses. 也是两次存取。没有load balancing的问题。

但是这个样子，不会有回路吗？还有映射到一起去了，node array里，两个结点同时连了positive和negative？
于是到了下面的碰撞分析环节。

analysis

collision error

两条边属性不同，但是重叠了。
在这里插入图片描述
$N^-$ 里面的颜色应该都是相同的，但这个时候却出现了一条positive

Simple case:a positive edge overlaps with a negative edge.

设不重叠的positive edges有 $m_+$ ，不重叠的negative edges有 $m_-$ ，一个negative edg和任意一个positive edges碰撞的概率是 $m_{+} /\left(\begin{array}{l}{n} \\ {2}\end{array}\right)=2 m_{+} /[n(n-1)]$ （一种选择的概率是 $/\left(\begin{array}{l}{n} \\ {2}\end{array}\right)=2 m_{+} /[n(n-1)]$ ,现在有 $m_{+}$ 个选择，所以概率是那么大）。在算碰撞错误数期望上界的时候，由于negative edges服从二项分布：
$E(\text {collision}) \leqslant \frac{m_{+} m_{-}}{\left(\begin{array}{c}{n} \\ {2}\end{array}\right)}=\frac{2 m_{+} m_{-}}{n(n-1)}$
没有碰撞发生的概率下界是
$P(\text {no collision}) \geqslant\left(1-\frac{m_{+}}{\left(\begin{array}{l}{n} \\ {2}\end{array}\right)}\right)^{m_{-}}=\left(1-\frac{2 m_{+}}{n^{2}}\right)^{\frac{n^{2}}{2 m_{+}} \times \frac{2 m_{+}}{n^{2}} m_{-}} \approx \mathrm{e}^{-\frac{2 m_{+} m_{-}}{n^{2}}}$
少加上一个 $\approx$

general cases

分析这种情况：two nodes are indirectly connected by a list of of collision errors and the probability that no collision error continuous negative edges, and are at the same time directly happens are not influenced by the graph size when n/m connected by a positive edge。

For convenience, we name the list of continuous negative edges as an equivalent negative edge. 我们将equivalent negative edges的数量标记为 $_ ′ m^{'}_{\_}$ 。给定三个结点，其中两个被直接用negative edge相连的概率是： $\frac{3}{n} \times \frac{2}{n} m_{-}$ 一下子没想明白啊，难道第一个结点可以从n个里面选3个，所以是 $\frac{3}{n}$ ，那后面为毛还是n嘛！另一对nodes也被直接用negative edge相连的概率 $\frac{2}{n} \times \frac{1}{n}\left(m_{-}-1\right)$ ，因此三个节点被两条negative edges相连的概率是

$\frac{12 m_{-}\left(m_{-}-1\right)}{n^{4}}$

对于所有的N个节点，由三个节点组成的等价negative edges的个数的期望是：
$m_{-(3)}^{\prime}=\frac{12 m_{-}^{2}}{n^{4}}\left(\begin{array}{l}{n} \\ {3}\end{array}\right) \approx \frac{2 m_{-}^{2}}{n}$
由更多的节点形成的等价negative edges的个数期望可以以此类推。给定v个结点，它们被v-1条negative edges连接的概率是（要v个结点v-1条才形成等价链）
$\frac{v !}{2}\left(\frac{2 m_{-}}{n^{2}}\right)^{v-1}$
那么n个结点中v个结点形成等价negative edges的数目为
$m_{-(v)}^{\prime}=\frac{2^{v-2} v ! m_{-}^{v-1}}{n^{2 v-2}}\left(\begin{array}{l}{n} \\ {v}\end{array}\right) \approx \frac{2^{v-2} m_{-}^{v-1}}{n^{v-2}}$
of collision errors, we can directly sum up the probability of collisions for all negative edges because they follow a binomial
From equation 4, we can find that the number of equivalent negative edges is approximate to a geometric progression（等比级数/几何级数） when the value of v increases. 只展示 $n>2m_{-}$ 的结果

$m_{-}^{\prime}=\sum_{v=3}^{n} m_{2(v)}^{\prime} \approx \sum_{v=3}^{n} \frac{2^{v-2} m_{-}^{v-1}}{n^{v-2}} \approx \frac{2 m_{-}^{2}}{n-2 m_{-}}, \quad n>2 m_{-}$
将简单推导中的 $m^{'}_{-}$ 替换成 $m_{-}+m_{-}^{\prime}$
$E(\text {collision}) \leqslant \frac{2 m_{+}\left(m_{-}+m_{-}^{\prime}\right)}{n(n-1)}=\frac{2 m_{+} m_{-}}{(n-1)\left(n-2 m_{-}\right)}$
$\text { collision }) \approx \mathbf{e}^{-\frac{2}{n(n-1)} m+\left(m_{-}+m_{-}^{\prime}\right)}=\mathbf{e}^{-\frac{2 m_{+} m_{-}}{(n-1)\left(n-2 m_{-}\right)}}$
如果 $n<2m_{-}$ ，这个图很难上色成功。
从最后两条公式可以看到，当n/m值固定的时候，collision errors and the probability that no collision error happens不受图大小的影响。

When n/m ratio is larger than 1.1, which means each element uses more than 1.1 * 2= 2.2 bits, the expectation of the number of collision errors is less than 5 , 并且the probability that no collision error happens is larger than 50%

color error

图的色数>4，上色失败
在RDG上色算法中，如果在图中发现了一个4-core，就放弃上色。Theories about k-cores in random graphs are established in [32]
$c_{k}=k+\sqrt{k \log k}+O(\log k)$ ，当k>=3且n很大时，如果 $m_+$ 的数目大于 $c_kn/2$ ，有很大的概率存在巨大的k-core，当 $m_+$ 的数目小于 $c_kn/2$ ，不会有k-core
当没有negative edges，考虑到core为4，n/m ratio threshold is 0.389。有negative edges时，图就不是随机图了，[32]的结果不适用了。因为negative edges连的点会缩成一个点，这个大点会有很多的邻居，它所在的子图会很密集，就很可能导致4-core的出现。因此，当negative edges比例大的时候，n/m ratio也应该大一点。In the worst case, when the negative edges account for 50% of all edges, the n/m ratio threshold is 1.10 according to our experiments. In conclusion, we need no more than 1.10⇥2=2.20 bits per element to build a coloring embedder to ensure that no color error happens.

这个是在讨论，node array应该设置几个node。神奇。

experimental results

experimental setup

dataset

MACTable：第几条作为关键字，use the type field (static or dynamic) to determine the set.
MachineLearning：训练集作为数据集
DBLP：We use the key attribute as our key. We use the records of articles as S+ and the records of inproceedings as S-。DBLP（DataBase systems and Logic Programming）是计算机领域内对研究的成果以作者为核心的一个计算机类英文文献的集成数据库系统。
在这里插入图片描述
合成数据：生成随机字符串作为关键字，构造合成数据集的原因是检查数据结构性能随S-比例变化的关系。We argue that for data structures using hash functions, including the coloring embedder, real datasets and synthetic datasets have no difference.

the state-of-art implementation

对比试验：
multiple bloom filter（MultiBF）：assemble the BF，每一个代表一个集合。
Coded Bloom filter（CodedBF) ：BF的常用变种，using multiple filters，将set IDs转换成二进制码，保存在BF中。
Shifting Bloom filter（ShiftBF）：