[大数据文章之其四] 概率数据结构简介

最新推荐文章于 2024-06-12 09:50:15 发布

StoneDemo

最新推荐文章于 2024-06-12 09:50:15 发布

阅读量1.6k

点赞数

分类专栏：业余兴趣文章标签：大数据算法概率统计

业余兴趣专栏收录该内容

21 篇文章 0 订阅

订阅专栏

日常前言

五月六月，又陷入反反复复的项目 Bug 中了。讲道理，分析日志是越来越熟练了，代码水平其实没有很大提高，毕竟改 Bug 嘛，大多只是在原有代码的基础上，添加或者修改一些业务逻辑。虽然改原生代码的时候能学到很多东西，但是那些部分很少出现问题，绝大部分还是我们自己人加入、修改的逻辑挖出来的坑。填坑的过程真是漫长又令人心烦。
时间有限，这次的翻译也只选了五个短篇，尽量提高内容质量同时也节省出一些业余时间学习一些其它知识。
不过这次翻译对我来说，收获颇丰。在翻译其中两篇文章的时候，我做了详细的笔记，并且在阅读过程中查阅了不少相关资料，学到了很多东西。
其中一篇是数据可视化的艺术，虽然只是以网页性能分析为例，对各种常用图表作了简单的适用场景的介绍，但是这正是我最近需要了解的内容 —— 由于业务原因，我需要经常接入第三方算法，并评测其性能。然而组内一直都是用打印 Log 的方式去分析性能，很不方便，而且经常会忽略掉一些异常变化。我正需要一些方法提高我们的性能分析效率，而这篇文章则给我指明了方向。
另一篇则是关于概率数据结构的介绍。选择翻译这篇文章是因为看到了 Bloom Filter，这让我想起了大学时给老师打工写爬虫的时光……这次顺势重温了 Bloom Filter，并了解了 HyperLogLog 与 Min-Count Sketch 这两个算法。我总觉得在不久的将来我就会用上它们。
这一期文章依旧采纳了四篇：
说到版权问题，我其实不太清楚我这样中英翻译的方式发 Blog 是不是侵了英文原版的版权了。但是如果不是中英翻译的话，发这些 Blog 就是多此一举了。如果侵权了的话，以后再删掉吧~

版权相关

翻译人：StoneDemo，该成员来自云+社区翻译社
原文链接：Introduction to Probabilistic Data Structures
原文作者：Yin Niu

Introduction to Probabilistic Data Structures

题目：（概率数据结构简介）

When processing large data sets, we often want to do some simple checks, such as number of unique items, most frequent items, and whether some items exist in the data set. The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. But when the data set we are dealing with becomes very large, such data structures are simply not feasible because the data is too big to fit in the memory. It becomes even more difficult for streaming applications which typically require data to be processed in one pass and perform incremental updates.

在处理大型的数据集时，我们常常进行一些简单的检查，如稀有项（Unique items）的数量、最常见的项，以及数据集中是否存在某些指定的项。通常的做法是使用某种确定性的数据结构，如 HashSet（哈希集）或 Hashtable（哈希表）来达此目的。但是当我们所处理的数据集十分巨大时，这样的数据结构完全不可行，因为数据量太大，我们没有足够的存储空间。对于通常需要在一次传递（One pass）中处理数据并执行增量更新的流媒体应用（Streaming application）来说，这就变得更加困难。

Probabilistic data structures are a group of data structures that are extremely useful for big data and streaming applications. Generally speaking, these data structures use hash functions to randomize and compactly represent a set of items. Collisions are ignored but errors can be well-controlled under certain threshold. Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized.

概率数据结构（Probabilistic data structures）是一组数据结构，它们对于大数据和流式应用来说非常有用。一般而言，这类数据结构使用哈希函数（Hash function）来随机化并紧凑地表示一个项的集合。忽略掉碰撞（Collision）的情况，但错误可以在一定的阈值下得到很好的控制。与无错方法相比，这些算法使用的内存更少，并且具有常数级的查询时间复杂度。他们通常支持并集（Union）和交集（Intersection）操作，因此可以很容易地使其并行化。

This article will introduce three commonly used probabilistic data structures: Bloom filter, HyperLogLog, and Count-Min sketch.

本文将介绍三种常用的概率数据结构：Bloom filter（布隆过滤器），HyperLogLog（基数估计法），以及 Count-Min sketch（最小略图）。

Membership Query - Bloom filter

（成员关系查询 —— Bloom filter）

A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set. A Bloom filter with 1% false positive rate only requires 9.6 bits per element regardless of the size of the elements.

Bloom filter 是具有 m 个数位的位阵列（Bit array），每一个数位被初始化为 0。要添加一个元素，则先将其流入 k 个哈希函数，以获取 k 个阵列位置，并将这些位置对应的数位设置为 1。查询某元素时，将其流入 k 个哈希函数来获得 k 个阵列位置。如果这些位置中有任何一个 0，则该元素必定不在该集合中。如果这些位全部为 1，那么该元素可能在该集合中。无论元素的大小如何，每个元素仅需要有 9.6 个位，就能使 Bloom filter 具有 1％的低误报率（False positive rate）。

这里写图片描述

For example, if we have inserted x, y, z into the bloom filter, with k=3 hash functions like the picture above. Each of these three elements has three bits each set to 1 in the bit array. When we look up for w in the set, because one of the bits is not set to 1, the bloom filter will tell us that it is not in the set.

例如，如果我们将 x，y，z 添加到 Bloom filter 中，并使用 3 个哈希函数（即 k = 3），如上图所示。这三个元素，每一个都在位阵列中有三个位，每个位都设置为 1。当我们在集合中查找 w 时，由于其中一个比特未被设置为 1，Bloom filter 会告诉我们它不在集合中。

Bloom filter has the following properties:

False positive is possible when the queried positions are already set to 1. But false negative is impossible.
Query time is O(k).
Union and intersection of bloom filters with same size and hash functions can be implemented with bitwise OR and AND operations.
Cannot remove an element from the set.

Bloom filter 具有以下特性：

当查询的位置都已经设置为 1 时，可能出现误报。但是，错误否定（False negative，在此处表示对于 “不在集合中” 的错误判定）是不可能的。
查询时间是 O（k）。
具有相同大小和散列函数的 Bloom filter 的并集和交集操作，可以通过按位 OR 和 AND 操作来实现。
无法从集合中删除元素。

Bloom filter requires the following inputs:

m: size of the bit array

n: estimated insertion

p: false positive probability

布隆过滤器需要以下几种输入：

m：位阵列的大小
n：预计要插入的元素数量（插入次数）
p：误报率

The optimum number of hash functions k can be determined using the formula:

使用以下公式可以确定哈希函数的最佳数量 k：

这里写图片描述

Given false positive probability p and the estimated number of insertions n, the length of the bit array can be calculated as:

给定误报率 p 和预计的插入次数 n，位阵列的长度可以通过下式计算：

这里写图片描述

The hash functions used for bloom filter should generally be faster than cryptographic hash algorithms with good distribution and collision resistance. Commonly used hash functions for bloom filter include Murmur hash, fnv series of hashes and Jenkins hashes. Murmur hash is the fastest among them. MurmurHash3 is used by Google Guava library’s bloom filter implementation.

通常用于 Bloom filter 的哈希函数应该比具有良好分布和抗碰撞性的加密哈希算法更快。Bloom filter 常用的哈希函数包括 Murmur 哈希函数，fnv 的一系列哈希函数，以及 Jenkins 哈希函数。Murmur 哈希是其中最快的。谷歌在其 Guava 库中实现的 Bloom filter 使用了 MurmurHash3。

Cardinality - HyperLogLog

（基数 —— HyperLogLog）

HyperLogLog is a streaming algorithm used for estimating the number of distinct elements (the cardinality) of very large data sets. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.

HyperLogLog 是一种流式算法，用于估算极大型数据集中不同元素（基数）的数量。HyperLogLog 计数器可以仅使用 1.5KB 的内存计算出 10 亿个不同的项，同时其精确度为 2％。该算法基于位模式观察（Bit pattern observation），对于随机分布的数字流，若有一个数字 x ，它具有最多的 k 个前导位 0，则流的基数很可能等于 2^k。

For each element si in the stream, hash function h(si) transforms si into string of random bits (0 or 1 with probability of 1/2):

对于流中的每个元素 si，使用哈希函数 h（si）将 si 转换为随机比特串（0 或 1，概率为 1/2）：

这里写图片描述

The probability P of the bit patterns:

0xxxx… → P = 1/2

01xxx… → P = 1/4

001xx… → P = 1/8

位模式的概率 P 则如下：

0xxxx …→P = 1/2

01xxx …→P = 1/4

001xx …→P = 1/8

The intuition is that when we are seeing prefix 0k 1…, it’s likely there are n ≥ 2k+1 different strings. By keeping track of prefixes 0k 1… that have appeared in the data stream, we can estimate the cardinality to be 2p, where p is the length of the largest prefix.

当我们看到前缀为 0k 1 … 时，直觉告诉我们，这很可能有 n≥2^(k+1) 个不同的字符串。通过跟踪出现在数据流中的前缀 0k 1 …，我们可以估计其基数为 2^p，其中 p 是最大前缀的长度。

Because the variance is very high when using single counter, in order to get a better estimation, data is split into m sub-streams using the first few bits of the hash. The counters are maintained by m registers each has memory space of multiple of 4 bytes. If the standard deviation for each sub-stream is σ, then the standard deviation for the averaged value is only σ/√m. This is called stochastic averaging.

由于使用单个计数器时方差非常高，为了获得更好的估计，我们使用哈希值的前几位将数据拆分为 m 个子流。计数器分别由 m 个寄存器维护，其中每个寄存器具有 4 字节的倍数大小的存储空间。如果每个子流的标准偏差为 σ，则平均值的标准偏差仅为 σ/√m。这被称为随机平均（Stochastic averaging）。

For instance for m=4,

例如，对于m = 4，

这里写图片描述

The elements are split into m stream using the first 2 bits (00, 01, 10, 11) which are then discarded. Each of the register stores the rest of the hash bits that contains the largest 0k 1 prefix. The values in the m registers are then averaged to obtain the cardinality estimate.

使用前两位（00,01,10,11）将元素分成 m 个流，然后将其丢弃。每个寄存器存储包含最大 0k 1 前缀的其余哈希比特。然后将 m 个寄存器中的值平均起来以获得基数估计。

HyperLogLog algorithm uses harmonic mean to normalize result. The algorithm also makes adjustment for small and very large values. The resulting error is equal to 1.04/√m.

HyperLogLog 算法使用调和均值（Harmonic mean）来将结果归一化。该算法还可以根据小的值与非常大的值进行调整。由此产生的误差等于 1.04 /√m。

Each of the m registers uses at most log2log2 n + O(1) bits when cardinalities ≤ n need to be estimated.

当需要估计的基数小于等于 n 时，m 个寄存器中的任一个最多使用 log2(log2(n)) + O(1) 个比特位。

Union of two HyperLogLog counters can be calculated by first taking the maximum value of the two counters for each of the m registers, and then calculate the estimated cardinality.

要计算两个 HyperLogLog 计数器的并集，可以先计算出每个计数器中的 m 个寄存器，将不同计数器的寄存器进行比较并取最大值，然后再计算估计的基数。

Frequency - Count-Min Sketch

（频率 —— Count-Min Sketch）

Count-Min sketch is a probabilistic sub-linear space streaming algorithm. It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.

Count-Min Sketch 是概率子线性空间流算法。它与 Bloom filter 在某种程度上是相似的。它们的主要区别在于，Bloom filter 用位图来表示一个集合，而 Count-Min Sketch 则用位图来表示一个保存了频率分布概况的多重集（Multi-set）。

The basic data structure is a two dimensional d x w array of counters with d pairwise independent hash functions h1 … hd of range w. Given parameters (ε,δ), set w = [e/ε], and d = [ln1/δ]. ε is the accuracy we want to have and δ is the certainty with which we reach the accuracy. The two dimensional array consists of wd counts. To increment the counts, calculate the hash positions with the d hash functions and update the counts at those positions.

其基本数据结构是一个二维的（d * w）计数器阵列，它具有 d 个两两独立的哈希函数 h1 … hd，它们的值域都在 w 内。给定参数（ε，δ），令 w = [e /ε]，d = [ln1 / δ]。ε 是我们想要的准确度，δ 是我们达到准确度的确定性（Certainty）。二维数组由 wd 计数组成。要增加计数，则需使用 d 个哈希函数计算哈希位置，并更新这些位置的计数。

这里写图片描述

The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions.

项的计数估计值是由 d 个哈希函数所确定的阵列位置处的最小计数值。

The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.

Count-Min Sketch 使用的空间是 w * d 个计数器的数组。通过选择合适的 d 和 w 值，可以实现非常小的误差和高概率。

Example of Count-Min sketch sizes for different error and probability combination:

不同错误和概率组合的 Count-Min Sketch 尺寸示例：

ε	1 - δ	w	d	wd
0.1	0.9	28	3	84
0.1	0.99	28	5	140
0.1	0.999	28	7	196
0.01	0.9	272	3	816
0.01	0.99	272	5	1360
0.01	0.999	272	7	1940
0.0001	0.999	2719	7	19033

Count-Min sketch has the following properties:

Union can be performed by cell-wise ADD operation
O(k) query time
Better accuracy for higher frequency items (heavy hitters)
Can only cause over-counting but not under-counting

Count-Min Sketch 具有以下特性：

并集可以通过按位的 ADD 操作实现
O(k) 的查询时间复杂度
频率越高的项（比如 Heavy hitters，大流量对象），其准确度越高
只会造成重复计算，但不会计算不足（即频率值不会偏低）

Count-Min sketch can be used for querying single item count or “heavy hitters” which can be obtained by keeping a heap structure of all the counts.

Count-Min Sketch 可用于查询单个项的计数或 “Heavy hitters”（可通过保留所有计数的堆结构来获得）。

Summary

（总结）

Probabilistic data structures have many applications in modern web and data applications where the data arrives in a streaming fashion and needs to be processed on the fly using limited memory. Bloom filter, HyperLogLog, and Count-Min sketch are the most commonly used probabilistic data structures. There are a lot of research on various streaming algorithms, synopsis data structures and optimization techniques that are worth investigating and studying.

概率数据结构在现代网络和数据应用程序中已经有了许多应用，这些应用中的数据以流的方式到达，并且需要使用有限的内存进行即时处理。Bloom filter，HyperLogLog 和 Count-Min Sketch 是最为常用的概率数据结构。对于各种流媒体算法、概要数据结构（Synopsis data structure）和优化技术已经有很多相关的研究，这些都值得我们去学习学习。

If you haven’t tried these data structures, you will be amazed how powerful they can be once you start using them. It may be a little bit intimidating to understand the concept initially, but the implementation is actually quite simple. Google Guava has Bloom filter implementation using murmur hash. Clearspring’s Java library stream-lib and Twitter’s Scala library Algebird have implementation for all three data structures and other useful data structures that you can play with. I have included the links below.

如果您还没有尝试过这些数据结构，那么一旦您开始使用它们，您会惊奇地发现它们有多么强大。刚开始时，理解它们的概念可能有些吓人，但实际上，要实现它们非常简单。Google Guava 使用 Murmur 哈希来实现了 Bloom filter。Clearspring 的 Java 库 stream-lib，以及 Twitter 的 Scala 库 Algebird 已经实现了所有的这三种数据结构和其他有用的数据结构，您可以通过这些库来使用它们。我已经把相应的链接列在下面。