Randomized Algorithms: Hashing: A Randomized Dictionary

The Problem

  1. there is a universe U U U of possible elements that is extremely large.
  2. the data structure is trying to keep track of a set S ⊆ U S \subseteq U SU whose size is generally a negligible fraction of U U U.
  3. the goal is to be able to insert and delete elements from S S S and quickly determine whether a given element belongs to S S S.

我们称这样的数据结构为 dictionary. 并且支持以下几种 operation:

  • MakeDictionary: initializes a dictionary that can maintain a subset S S S of U U U; the dictionary starts out empty.
  • Insert( u u u): adds element u ∈ U u \in U uU to the set S S S.
  • Delete( u u u): removes element u u u from the set S S S, if it is currently present.
  • Lookup( u u u): determines whether u u u currently belongs to S S S; if it does, it also retrieves any additional information stored with u u u.

如果 U U U 很小,那么可以用一个长度为 ∣ U ∣ |U| U 的 array,每一个 entry 表示一个 element,若 u ∈ S u\in S uS 则对应 entry 设为 1,若 u ∉ S u \notin S u/S 则对应 entry 设为 0。
但是,we are considering the setting in which the universe U U U is enormous. So we are not going to be able to use an array whose size is anywhere near that of U U U.

Designing the Data Structure

Hash Functions

Suppose we want to be able to store a set S S S of size up to n n n.

  1. set up an array H H H of size n n n to store the information
  2. use a function h : U → { 0 , 1 , . . . , n − 1 } h:U→\{0,1,...,n−1\} h:U{0,1,...,n1} that maps elements of U U U to array positions.
  3. Store u ∈ S u \in S uS at position h ( u ) h(u) h(u) of the array H H H.
  4. n ∼ ∣ S ∣ n \sim |S| nS

我们称 h h hhash function, H H Hhash table.

我们期望:for all distinct u u u and v v v in set S S S, h ( u ) ≠ h ( v ) h(u) \neq h(v) h(u)=h(v).

  • 这种情况下,we could look up u u u in constant time: check array position H [ h ( u ) ] H[h(u)] H[h(u)] - empty 或者 只有 u u u.

现实情况:there can be distinct elements u , v ∈ S u, v \in S u,vS for which h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v). 我们称 these two elements collide.

  • 有多种方式可以解决 collide 问题,这里 we assume that each position H [ i ] H[i] H[i] of the hash table stores a linked list of all elements u ∈ S u \in S uS with h ( u ) = i h(u) = i h(u)=i.

The operation L o o k u p ( u ) Lookup(u) Lookup(u) would now work as follow:

  • Compute the hash function h ( u ) h(u) h(u)
  • Scan the linked list at position H [ h ( u ) ] H[h(u)] H[h(u)] to see if u is present in this list.

Time required for L o o k u p ( u ) ∝ Lookup(u) \propto Lookup(u) time to compute h ( u ) + h(u) + h(u)+ the length of the linked list at H [ h ( u ) ] H[h(u)] H[h(u)].

I n s e r t ( u ) Insert(u) Insert(u): adds u u u to the linked list at position H [ h ( u ) ] H[h(u)] H[h(u)].
D e l e t e ( u ) Delete(u) Delete(u): scans H [ h ( u ) ] H[h(u)] H[h(u)] and removes u u u if it is present.

Main Goal: to show that randomization can “spreads out” the elements being added, though collision cannot be completely avoided.

Choosing A Good Hash Function

Basic Idea: for every element u ∈ U u \in U uU, when we go to insert u u u into S S S, we select a value h ( u ) h(u) h(u) uniformly at random in the set { 0 , 1 , . . . , n − 1 } \{0, 1, . . . , n − 1\} {0,1,...,n1}, independently of all previous choices.

定理 (12.22): With this uniform random hashing scheme, the probability that two randomly selected values h ( u ) h(u) h(u) and h ( v ) h(v) h(v) collide—that is, that h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v)—is exactly 1 / n 1/n 1/n

证明:
一共有 n 2 n^2 n2 possible pairs of value ( h ( u ) , h ( v ) ) (h(u), h(v)) (h(u),h(v)), all are equal likely.
一共有 n n n pair collide
P ( c o l l i d e ) = 1 n P(collide)=\frac{1}{n} P(collide)=n1

但是,use a hash function with independently random chosen values 是不可行的,因为需要调用 D e l e t e ( u ) Delete(u) Delete(u) 或者 i n s e r t ( u ) insert(u) insert(u) 时,不知道 h ( u ) h(u) h(u) 的值。 如果每个对应值都 write down,占用 space 太大。

Universal Classes of Hash Functions

Choose a hash function at random from a carefully selected class of functions. Each function h h h in our class of functions H \mathcal{H} H will map the universe U U U into the set { 0 , 1 , . . . , n − 1 } \{0,1,...,n−1\} {0,1,...,n1}.

h h h 性质:

  1. For any pair of elements u , v ∈ U u, v \in U u,vU, the probability that a randomly chosen h ∈ H h \in \mathcal{H} hH satisfies h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v) is at most 1 / n 1/n 1/n.
  2. Each h ∈ H h \in H hH can be compactly represented and, for a given h ∈ H h \in H hH and u ∈ U u \in U uU, we can compute the value h ( u ) h(u) h(u) efficiently.

定义:我们称 a class H \mathcal{H} H of functions is universal 如果这个 class 满足第一条性质。

定理 (13.23): Let H \mathcal{H} H be a universal class of hashing functions mapping a universe U U U to the set { 0 , 1 , . . . , n − 1 } \{0, 1, ..., n-1\} {0,1,...,n1}, let S S S be an arbitrary subset of U U U of size at most n n n, and let u u u be any element in U U U. We define X X X to be a random variable equal to the number of elements s ∈ S s \in S sS for which h ( s ) = h ( u ) h(s) = h(u) h(s)=h(u), for a random choice of hash function h ∈ H h \in \mathcal{H} hH. (Here S S S and u u u are fixed, and the randomness is in the choice of h ∈ H h \in \mathcal{H} hH.) Then E [ X ] ≤ 1 E [X]≤ 1 E[X]1.
(任意一个 hash function h ∈ H h \in \mathcal{H} hH,任意一个 u ∈ U u \in U uU,设 X X X s ∈ S s \in S sS 中满足 h ( s ) = h ( u ) h(s)=h(u) h(s)=h(u) s s s 的数量)

证明:
For an element s ∈ S s \in S sS, we define a random variable X s X_s Xs that is equal to 1 1 1 if h ( s ) = h ( u ) h(s)=h(u) h(s)=h(u), and equal to to 0 0 0 otherwise.
∵ \because the class of functions is universal
∴ \therefore E [ X s ] = P r ( X s = 1 ) ≤ 1 n E[X_s] =Pr(X_s=1) ≤ \frac{1}{n} E[Xs]=Pr(Xs=1)n1
Since X = ∑ s ≠ u , s ∈ S X s X=\sum_{s \neq u, s \in S} X_s X=s=u,sSXs, we have
E [ X ] = ∑ s ≠ u , s ∈ S E [ X s ] ≤ ( ∣ S ∣ − 1 ) ⋅ 1 n ≤ 1 E[X]=\sum_{s \neq u, s\in S}E[X_s]\leq (|S|-1)·\frac{1}{n} \leq 1 E[X]=s=u,sSE[Xs](S1)n11

Designing a Universal Class of Hash Functions

We will use a prime number p > ∣ S ∣ = n p > |S| = n p>S=n as the size of the hash table H H H.

例子
U = { words using ≤ 45 characters } U=\{\text{words using ≤ 45 characters}\} U={words using ≤ 45 characters}
characters → \text{characters} \to characters ASCII Number [0, 127]

  • a → 97 , b → 98 a \to97, b \to 98 a97,b98

words:  x = ( x 0 , x 1 , . . . x m − 1 ) \text{words: } x= (x_0, x_1, ...x_{m-1}) words: x=(x0,x1,...xm1) with m = 45 m=45 m=45 and x i ∈ [ 0 , 127 ] x_i \in [0, 127] xi[0,127]
e.g. a b a → ( 97 , 98 , 97 ) aba \to (97, 98, 97) aba(97,98,97)

类似上面例子,为了方便用 integer arithmetic,将 universe U U U 表示为 vector of the form x = ( x 1 , x 2 , . . . , x r ) x= (x_1, x_2, ..., x_r) x=(x1,x2,...,xr) for some integer r r r, where 0 ≤ x i < p 0 \leq x_i < p 0xi<p for each i i i.

  • first identify U U U with integers in range [ 0 , N − 1 ] [0, N-1] [0,N1] for some N N N。【首先,将 U U U 中的 N N N 个元素标记为 [ 0 , 1 , . . . , N − 1 ] [0, 1, ..., N-1] [0,1,...,N1]
  • then use consecutive blocks of ⌊ log ⁡ p ⌋ \lfloor\log p \rfloor logp bits of u u u to define the corresponding coordinates x i x_i xi. 【其次,Hash table H H H 中一共有 p p p 个 block,而 x i x_i xi 对应 H H H 中的一个位置,因此 x i x_i xi 应为 log ⁡ p \log p logp 位数。】(这里存疑, x i x_i xi 不应该对应 Hash Table 中的一个 position)
  • if U ⊆ [ 0 , N − 1 ] U \subseteq [0, N-1] U[0,N1], then we will need a number of coordinates r ≈ log ⁡ N / log ⁡ n r ≈ \log N/ \log n rlogN/logn. 【 N N N log ⁡ N \log N logN 位数, log ⁡ n \log n logn H H H entries 的个数的位数】

Let A \mathcal{A} A be the set of all vectors of the form a = ( a 1 , a 2 , . . . , a r ) a = (a_1, a_2, ..., a_r) a=(a1,a2,...,ar), where a i ∈ [ 0 , p − 1 ] a_i \in [0, p-1] ai[0,p1] for each i = 1 , . . . , r i = 1, ..., r i=1,...,r.
For each a ∈ A a \in \mathcal{A} aA, we define
h a ( x ) = ( ∑ i = 1 r a i x i ) m o d    p h_a(x)=\Bigg( \sum_{i=1}^{r} a_ix_i \Bigg) \mod p ha(x)=(i=1raixi)modp

定义: H = { h a : a ∈ A } \mathcal{H}=\{h_a: a \in \mathcal{A}\} H={ha:aA}

例子 r = 3 r=3 r=3, table size p = 101 p=101 p=101
a b a → ( 97 , 98 , 97 ) aba \to (97, 98, 97) aba(97,98,97)
Randomly choose a 1 , a 2 , a 3 = 0 , 1 , 2 a_1, a_2, a_3 = 0, 1, 2 a1,a2,a3=0,1,2
∑ i = 1 3 a i x I = 0 × 97 + 1 × 98 + 2 × 97 = 292 m o d    101 = 90 \sum_{i=1}^3 a_i x_I = 0 \times 97+1\times 98 + 2\times 97 = 292 \mod 101 = 90 i=13aixI=0×97+1×98+2×97=292mod101=90
Hence, h a ( a b a ) = 90 h_a(aba)=90 ha(aba)=90

性质:
(0). h : U → [ 0 , 1 , . . . , p − 1 ] h: U \to[0, 1, ..., p-1] h:U[0,1,...,p1].
(2). h a h_a ha can be computed by choosing and remembering a random a ∈ A a \in \mathcal{A} aA. We can compute h a ( u ) h_a(u) ha(u) for all elements u ∈ U u \in U uU by r r r multiplications on log ⁡ p \log p logp digits.

Analyzing the Data Structure

定理 (13.24): For any prime p p p and any integer z ≠ 0 m o d    p z \neq 0 \mod p z=0modp, and any two integers α , β \alpha, \beta α,β, if α z = β z m o d    p αz = βz \mod p αz=βzmodp, then α = β m o d    p α = β \mod p α=βmodp.

证明:
假设 α z = β z m o d    p \alpha z=\beta z \mod p αz=βzmodp
可以得到 ( α − β ) z = 0 m o d    p (\alpha-\beta) z = 0 \mod p (αβ)z=0modp,即 ( α − β ) z (\alpha-\beta)z (αβ)z 可以被 p p p 整除
∵ \because z z z 不能被 p p p 整除,且 p p p 是质数
∴ \therefore α − β \alpha-\beta αβ 可以被 p p p 整除,即 α = β m o d    p \alpha = \beta \mod p α=βmodp

定理 (13.25): The class of linear functions H \mathcal{H} H defined above is universal.

证明:
Let x = ( x 1 , x 2 , . . . , x r ) x=(x1, x_2, ..., x_r) x=(x1,x2,...,xr) and y = ( y 1 , y 2 , . . . , y r ) y=(y_1, y_2, ..., y_r) y=(y1,y2,...,yr) be two distinct elements of U U U.
需要证明:for a randomly chosen a ∈ A , P ( h a ( x ) = h a ( y ) ) ≤ 1 p a \in \mathcal{A}, P(h_a(x)=h_a(y))\leq \frac{1}{p} aA,P(ha(x)=ha(y))p1
Since x ≠ y x\neq y x=y, there must be an index j j j such that x j ≠ y j x_j \neq y_j xj=yj
Consider a hash function a = ( a 1 , a 2 , . . . , a r ) a = (a_1, a_2, ..., a_r) a=(a1,a2,...,ar) 【除 a j a_j aj 以外均为确定值,现在要选定 a j a_j aj 使 h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y)
h a ( x ) = ∑ i = 1 r a i x i m o d    p h_a(x) = \sum_{i=1}^r a_ix_i \mod p ha(x)=i=1raiximodp h a ( y ) = ∑ i = 1 r a i y i m o d    p h_a(y) = \sum_{i=1}^r a_i y_i \mod p ha(y)=i=1raiyimodp
欲求 P ( h a ( x ) = h a ( y ) ) P\big( h_a(x) = h_a(y) \big) P(ha(x)=ha(y)),即
a j ( x j − y j ) = ∑ i ≠ j a i ( y i − x i ) m o d    p a_j (x_j - y_j) = \sum_{i \neq j} a_i (y_i-x_i) \mod p aj(xjyj)=i=jai(yixi)modp 令等式右边为 m m m,令 y j − x j = z y_j - x_j =z yjxj=z,化简等式为
a j z = m m o d    p a_j z=m \mod p ajz=mmodp 其中 x j ≠ y j x_j \neq y_j xj=yj, 即 z ≠ 0 z \neq 0 z=0.
Claim: there is exactly one value 0 ≤ a j < p 0 \leq a_j <p 0aj<p that satisfy the above equation.
因为,假设 a j z = a j ′ z = m m o d    p a_j z = a_j' z = m \mod p ajz=ajz=mmodp
那么,由定理 (13.24) 可知, a j = a j ′ m o d    p a_j=a_j' \mod p aj=ajmodp
但是, a j , a j ′ < p a_j, a_j'<p aj,aj<p,所以 a j = a j ′ a_j=a_j' aj=aj
Hence, the probability of choosing a j a_j aj so that h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y) is 1 p \frac{1}{p} p1.
We have shown that H \mathcal{H} H is a universal class of hash functions.
  \,
以下证明为什么 a i , i ≠ j a_i, i\neq j ai,i=j 不会影响 h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y) 的概率。

  • Let E E E be the event that h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y)
  • Let F b \mathcal{F}_b Fb be the event that all coordinates a i a_i ai (for i ≠ j i\neq j i=j) receive a sequence of values b b b.

之前,我们证明了 P ( E ∣ F b ) = 1 / p P\big( E | \mathcal{F}_b \big)=1/p P(EFb)=1/p
全概率公式:
P ( E ) = ∑ b P ( E ∣ F b ) ⋅ P ( F b ) = 1 p ∑ b P ( F b ) = 1 p P(E)=\sum_b P\big( E|\mathcal{F}_b\big)\cdot P(\mathcal{F}_b)=\frac{1}{p}\sum_b P(\mathcal{F}_b)=\frac{1}{p} P(E)=bP(EFb)P(Fb)=p1bP(Fb)=p1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值