KDD2018数据流论文HeavyGuardian: Separate and Guard Hot Items in Data Streams

最新推荐文章于 2022-05-25 12:23:13 发布

玛卡巴卡米卡巴卡

最新推荐文章于 2022-05-25 12:23:13 发布

阅读量1k

点赞数 3

分类专栏：数据流文章标签：数据流

本文链接：https://blog.csdn.net/travalscx/article/details/84673653

版权

数据流专栏收录该内容

2 篇文章 0 订阅

订阅专栏

我之前初步学习了数据流的知识，现找一片论文看看。在kdd2018接受论文中看到了PKU，就果断选择了这一篇嘻嘻。
「HeavyGuardian: Separate and Guard Hot Items in Data Streams」

论文成果

建立了一个新的数据结构能够更快更准的解决五类问题：

frequency estimation
heavy hitter detection
heavy change detection
frequency distribution estimation
entropy estimation

frequency estimation 频率估计

按照论文中对这类问题的解释：
Frequency estimation is to estimate the number of appearances of any item in a data stream
就是估计一个item在这个流中出现的次数

heavy hitter detection

Heavy hitter detection is to find those items whose frequencies are larger than a predefined threshold.
找到频率大于初设阈值频率的item

heavy change detection

Heavy change detection is to find those items whose frequencies change drastically in two adjacent time windows
找到在数据流相邻两个窗口中频率变化大的item

frequency distribution estimation

Frequency distribution estimation is to estimate the number of distinct items whose frequencies are equal to any given value.
估计频率等于给定频率值的item的个数

entropy estimation

Entropy estimation is to estimate the entropy of a data stream in real time or within fixed-size time windows
估计给定窗口或real time的item的熵

INTRODUCTION

Background

作者认为不管是上述的五个问题还是其他的数据流问题，都聚焦于hot item，hot item在数据流中是少数，而cold item是多数，由于内存是有限的，记录大量的额cold item太浪费，因此重点就是如何判断到来的item是hot item还是cold item。

已经存在的数据结构有record-all-evict-code（first record
frequencies of all items, and then evict cold items开始记录所有的记录，然后去掉cold items）
使用record-all-evict-code数据结构的又一些应用，作者在论文的background中写到了两种方法Space-Saving和 the Aug-
mented sketch ，并简述了他们的要点。

Motivation

作者之后阐明了使用record-all-evict-code数据结构的缺点

内存利用不高效：内存存储了不是hot item但有没有被evict的item；记录所有item的频率
hot item的一些信息没有被很好记录：方法中的hot item的频率被记录的不准，偏大。
因此论文提出了一个新的数据结构：HeavyGuardian

HeavyGuardian简述

separate-and-guard-hot
可以分离hot item和cold item，并用大量记载hot item的信息，少量记载cold item的频率信息。

数据结构描述

将任务描述为：给一个数据流，如何用少量的cell去记录hottest item 和他的频率。
使用两个cell来存储两个 hottest items和他们的频率，这两个cell被称为：

king (the hottest item)
guardian (the second hottest item)

更新规则：

如果到来的item是一个supporter（例如和king或者guardian一样），那么就增加其对应king/guardian的频率
如果到来的item是一个rebel，那么就使用biased probability来减少guardian的频率f，
随着guardian的频率f的线性增加，概率biased probability呈指数下降
如果guardian 被 “killed”（例如他的频率为0），那么他就被删除，新到来的item成为guardian，如果guardian 有很多supporter，那么他可能成为king
进一步，如果使用multiple cells, 那么一个king可有两个guardians
同时有improvements:
（1) 把数据流分成一些小的sub-streams，对每个sub-stream都选出king和一些guardians
（2) 用小的counters记录rebels的频率

HeavyGuardian数据结构

使用哈希

hash table A，共有w个buckets，第i个bucket为A[i]
bucket A[i]被分成heavy part和light part，共有 $\lambda$ h + $\lambda$ l 个cells，第j个cell为A[i][j]
每一个cell都是一个kv对，k是id，v是counter

更新过程

初始化

每个buckets都是0

插入

使用哈希函数h将到来的item e插入到对应的桶A[h(e)]中

当此item哈希后是king：增加king的frequency (the count field)
当e不在A[h(e)]的heavy part中，并且有空的cell：就将其放入空的cell，设置id，count=1
当e不在A[h(e)]的heavy part中，并且没有空的cell：使用Exponential Decay技术：
使用概率p=b^-c将weakest guardian的count减少1。其中b是一个预设的值，c是这个weakest guardian的count数。也就是之前说的，随着guardian的频率f（count）的线性增加，概率biased probability呈指数下降。
当这个weakest guardian的count=0之后，这个位置就被e代替，设置id，count=1
当e于一个light part的cell一样，就将这个cell的count+1