Randomized Algorithms: Hashing: A Randomized Dictionary

最新推荐文章于 2022-11-04 09:55:42 发布

清幽小路

最新推荐文章于 2022-11-04 09:55:42 发布

阅读量153

点赞数

分类专栏：学习笔记文章标签：算法

本文链接：https://blog.csdn.net/weixin_43192983/article/details/108156721

版权

学习笔记专栏收录该内容

20 篇文章 0 订阅

订阅专栏

The Problem

there is a universe $U$ of possible elements that is extremely large.
the data structure is trying to keep track of a set $\subseteq U$ whose size is generally a negligible fraction of $U$ .
the goal is to be able to insert and delete elements from $S$ and quickly determine whether a given element belongs to $S$ .

我们称这样的数据结构为 dictionary. 并且支持以下几种 operation：

MakeDictionary: initializes a dictionary that can maintain a subset $S$ of $U$ ; the dictionary starts out empty.
Insert( $u$ ): adds element $\in U$ to the set $S$ .
Delete( $u$ ): removes element $u$ from the set $S$ , if it is currently present.
Lookup( $u$ ): determines whether $u$ currently belongs to $S$ ; if it does, it also retrieves any additional information stored with $u$ .

如果 $U$ 很小，那么可以用一个长度为 $∣ U ∣$ 的 array，每一个 entry 表示一个 element，若 $u\in S$ 则对应 entry 设为 1，若 $\notin S$ 则对应 entry 设为 0。
但是，we are considering the setting in which the universe $U$ is enormous. So we are not going to be able to use an array whose size is anywhere near that of $U$ .

Designing the Data Structure

Hash Functions

Suppose we want to be able to store a set $S$ of size up to $n$ .

set up an array $H$ of size $n$ to store the information
use a function $h:U→\{0,1,...,n−1\}$ that maps elements of $U$ to array positions.
Store $\in S$ at position $h (u)$ of the array $H$ .
$\sim |S|$

我们称 $h$ 为 hash function, $H$ 为 hash table.

我们期望：for all distinct $u$ and $v$ in set $S$ , $\neq h(v)$ .

这种情况下，we could look up $u$ in constant time: check array position $H [h (u)]$ - empty 或者只有 $u$ .

现实情况：there can be distinct elements $\in S$ for which $h (u) = h (v)$ . 我们称 these two elements collide.

有多种方式可以解决 collide 问题，这里 we assume that each position $H [i]$ of the hash table stores a linked list of all elements $\in S$ with $h (u) = i$ .

The operation $L o o k u p (u)$ would now work as follow:

Compute the hash function $h (u)$
Scan the linked list at position $H [h (u)]$ to see if u is present in this list.

Time required for $\propto$ time to compute $h (u) +$ the length of the linked list at $H [h (u)]$ .

$I n s e r t (u)$ : adds $u$ to the linked list at position $H [h (u)]$ .
$D e l e t e (u)$ : scans $H [h (u)]$ and removes $u$ if it is present.

Main Goal: to show that randomization can “spreads out” the elements being added, though collision cannot be completely avoided.

Choosing A Good Hash Function

Basic Idea: for every element $\in U$ , when we go to insert $u$ into $S$ , we select a value $h (u)$ uniformly at random in the set ${0, 1, . . . , n − 1\}$ , independently of all previous choices.

定理 (12.22): With this uniform random hashing scheme, the probability that two randomly selected values $h (u)$ and $h (v)$ collide—that is, that $h (u) = h (v)$ —is exactly $1 / n$

证明：
一共有 $n^2$ possible pairs of value $(h (u), h (v))$ , all are equal likely.
一共有 $n$ pair collide
$P(collide)=\frac{1}{n}$

但是，use a hash function with independently random chosen values 是不可行的，因为需要调用 $D e l e t e (u)$ 或者 $i n s e r t (u)$ 时，不知道 $h (u)$ 的值。如果每个对应值都 write down，占用 space 太大。

Universal Classes of Hash Functions

Choose a hash function at random from a carefully selected class of functions. Each function $h$ in our class of functions $\mathcal{H}$ will map the universe $U$ into the set ${0,1,...,n−1\}$ .

$h$ 性质：

For any pair of elements $\in U$ , the probability that a randomly chosen $\in \mathcal{H}$ satisfies $h (u) = h (v)$ is at most $1 / n$ .
Each $\in H$ can be compactly represented and, for a given $\in H$ and $\in U$ , we can compute the value $h (u)$ efficiently.

定义：我们称 a class $\mathcal{H}$ of functions is universal 如果这个 class 满足第一条性质。

定理 (13.23): Let $\mathcal{H}$ be a universal class of hashing functions mapping a universe $U$ to the set ${0, 1, ..., n-1\}$ , let $S$ be an arbitrary subset of $U$ of size at most $n$ , and let $u$ be any element in $U$ . We define $X$ to be a random variable equal to the number of elements $\in S$ for which $h (s) = h (u)$ , for a random choice of hash function $\in \mathcal{H}$ . (Here $S$ and $u$ are fixed, and the randomness is in the choice of $\in \mathcal{H}$ .) Then $E [X] \leq 1$ .
（任意一个 hash function $\in \mathcal{H}$ ，任意一个 $\in U$ ，设 $X$ 为 $\in S$ 中满足 $h (s) = h (u)$ 的 $s$ 的数量）

证明：
For an element $\in S$ , we define a random variable $X_s$ that is equal to $1$ if $h (s) = h (u)$ , and equal to to $0$ otherwise.
$\because$ the class of functions is universal
$\therefore$ $E[X_s] =Pr(X_s=1) ≤ \frac{1}{n}$
Since $X=\sum_{s \neq u, s \in S} X_s$ , we have
$E[X]=\sum_{s \neq u, s\in S}E[X_s]\leq (|S|-1)·\frac{1}{n} \leq 1$

Designing a Universal Class of Hash Functions

We will use a prime number $p > ∣ S ∣ = n$ as the size of the hash table $H$ .

例子
设 $U=\{\text{words using ≤ 45 characters}\}$
$\text{characters} \to$ ASCII Number [0, 127]

$\to97, b \to 98$

$\text{words: } x= (x_0, x_1, ...x_{m-1})$ with $m = 45$ and $x_i \in [0, 127]$
e.g. $\to (97, 98, 97)$

类似上面例子，为了方便用 integer arithmetic，将 universe $U$ 表示为 vector of the form $x= (x_1, x_2, ..., x_r)$ for some integer $r$ , where $\leq x_i < p$ for each $i$ .

first identify $U$ with integers in range $[0, N - 1]$ for some $N$ 。【首先，将 $U$ 中的 $N$ 个元素标记为 $[0, 1, . . ., N - 1]$ 】
then use consecutive blocks of $\lfloor\log p \rfloor$ bits of $u$ to define the corresponding coordinates $x_i$ . 【其次，Hash table $H$ 中一共有 $p$ 个 block，而 $x_i$ 对应 $H$ 中的一个位置，因此 $x_i$ 应为 $\log p$ 位数。】（这里存疑， $x_i$ 不应该对应 Hash Table 中的一个 position）
if $\subseteq [0, N-1]$ , then we will need a number of coordinates $\log N/ \log n$ . 【 $N$ 是 $\log N$ 位数， $\log n$ 是 $H$ entries 的个数的位数】

Let $\mathcal{A}$ be the set of all vectors of the form $a = (a_1, a_2, ..., a_r)$ , where $a_i \in [0, p-1]$ for each $i = 1, . . ., r$ .
For each $\in \mathcal{A}$ , we define
$h_a(x)=\Bigg( \sum_{i=1}^{r} a_ix_i \Bigg) \mod p$

定义： $\mathcal{H}=\{h_a: a \in \mathcal{A}\}$

例子 $r = 3$ , table size $p = 101$
$\to (97, 98, 97)$
Randomly choose $a_1, a_2, a_3 = 0, 1, 2$
$\sum_{i=1}^3 a_i x_I = 0 \times 97+1\times 98 + 2\times 97 = 292 \mod 101 = 90$
Hence, $h_a(aba)=90$

性质：
(0). $\to[0, 1, ..., p-1]$ .
(2). $h_a$ can be computed by choosing and remembering a random $\in \mathcal{A}$ . We can compute $h_a(u)$ for all elements $\in U$ by $r$ multiplications on $\log p$ digits.

Analyzing the Data Structure

定理 (13.24): For any prime $p$ and any integer $\neq 0 \mod p$ , and any two integers $\alpha, \beta$ , if $\mod p$ , then $\mod p$ .

证明：
假设 $\alpha z=\beta z \mod p$
可以得到 $(\alpha-\beta) z = 0 \mod p$ ，即 $(\alpha-\beta)z$ 可以被 $p$ 整除
$\because$ $z$ 不能被 $p$ 整除，且 $p$ 是质数
$\therefore$ $\alpha-\beta$ 可以被 $p$ 整除，即 $\alpha = \beta \mod p$

定理 (13.25): The class of linear functions $\mathcal{H}$ defined above is universal.

证明：
Let $x=(x1, x_2, ..., x_r)$ and $y=(y_1, y_2, ..., y_r)$ be two distinct elements of $U$ .
需要证明：for a randomly chosen $\in \mathcal{A}, P(h_a(x)=h_a(y))\leq \frac{1}{p}$
Since $x\neq y$ , there must be an index $j$ such that $x_j \neq y_j$
Consider a hash function $a = (a_1, a_2, ..., a_r)$ 【除 $a_j$ 以外均为确定值，现在要选定 $a_j$ 使 $h_a(x)=h_a(y)$ 。
$h_a(x) = \sum_{i=1}^r a_ix_i \mod p$ $h_a(y) = \sum_{i=1}^r a_i y_i \mod p$
欲求 $P\big( h_a(x) = h_a(y) \big)$ ，即
$a_j (x_j - y_j) = \sum_{i \neq j} a_i (y_i-x_i) \mod p$ 令等式右边为 $m$ ，令 $y_j - x_j =z$ ，化简等式为
$a_j z=m \mod p$ 其中 $x_j \neq y_j$ , 即 $\neq 0$ .
Claim: there is exactly one value $\leq a_j <p$ that satisfy the above equation.
因为，假设 $a_j z = a_j' z = m \mod p$
那么，由定理 (13.24) 可知， $a_j=a_j' \mod p$
但是， $a_j, a_j'<p$ ，所以 $a_j=a_j'$
Hence, the probability of choosing $a_j$ so that $h_a(x)=h_a(y)$ is $\frac{1}{p}$ .
We have shown that $\mathcal{H}$ is a universal class of hash functions.
$\,$
以下证明为什么 $a_i, i\neq j$ 不会影响 $h_a(x)=h_a(y)$ 的概率。

Let $E$ be the event that $h_a(x)=h_a(y)$
Let $\mathcal{F}_b$ be the event that all coordinates $a_i$ (for $i\neq j$ ) receive a sequence of values $b$ .

之前，我们证明了 $P\big( E | \mathcal{F}_b \big)=1/p$
全概率公式：
$P(E)=\sum_b P\big( E|\mathcal{F}_b\big)\cdot P(\mathcal{F}_b)=\frac{1}{p}\sum_b P(\mathcal{F}_b)=\frac{1}{p}$