目录
The Problem
- there is a universe U U U of possible elements that is extremely large.
- the data structure is trying to keep track of a set S ⊆ U S \subseteq U S⊆U whose size is generally a negligible fraction of U U U.
- the goal is to be able to insert and delete elements from S S S and quickly determine whether a given element belongs to S S S.
我们称这样的数据结构为 dictionary. 并且支持以下几种 operation:
- MakeDictionary: initializes a dictionary that can maintain a subset S S S of U U U; the dictionary starts out empty.
- Insert( u u u): adds element u ∈ U u \in U u∈U to the set S S S.
- Delete( u u u): removes element u u u from the set S S S, if it is currently present.
- Lookup( u u u): determines whether u u u currently belongs to S S S; if it does, it also retrieves any additional information stored with u u u.
如果
U
U
U 很小,那么可以用一个长度为
∣
U
∣
|U|
∣U∣ 的 array,每一个 entry 表示一个 element,若
u
∈
S
u\in S
u∈S 则对应 entry 设为 1,若
u
∉
S
u \notin S
u∈/S 则对应 entry 设为 0。
但是,we are considering the setting in which the universe
U
U
U is enormous. So we are not going to be able to use an array whose size is anywhere near that of
U
U
U.
Designing the Data Structure
Hash Functions
Suppose we want to be able to store a set S S S of size up to n n n.
- set up an array H H H of size n n n to store the information
- use a function h : U → { 0 , 1 , . . . , n − 1 } h:U→\{0,1,...,n−1\} h:U→{0,1,...,n−1} that maps elements of U U U to array positions.
- Store u ∈ S u \in S u∈S at position h ( u ) h(u) h(u) of the array H H H.
- n ∼ ∣ S ∣ n \sim |S| n∼∣S∣
我们称 h h h 为 hash function, H H H 为 hash table.
我们期望:for all distinct u u u and v v v in set S S S, h ( u ) ≠ h ( v ) h(u) \neq h(v) h(u)=h(v).
- 这种情况下,we could look up u u u in constant time: check array position H [ h ( u ) ] H[h(u)] H[h(u)] - empty 或者 只有 u u u.
现实情况:there can be distinct elements u , v ∈ S u, v \in S u,v∈S for which h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v). 我们称 these two elements collide.
- 有多种方式可以解决 collide 问题,这里 we assume that each position H [ i ] H[i] H[i] of the hash table stores a linked list of all elements u ∈ S u \in S u∈S with h ( u ) = i h(u) = i h(u)=i.
The operation L o o k u p ( u ) Lookup(u) Lookup(u) would now work as follow:
- Compute the hash function h ( u ) h(u) h(u)
- Scan the linked list at position H [ h ( u ) ] H[h(u)] H[h(u)] to see if u is present in this list.
Time required for L o o k u p ( u ) ∝ Lookup(u) \propto Lookup(u)∝ time to compute h ( u ) + h(u) + h(u)+ the length of the linked list at H [ h ( u ) ] H[h(u)] H[h(u)].
I
n
s
e
r
t
(
u
)
Insert(u)
Insert(u): adds
u
u
u to the linked list at position
H
[
h
(
u
)
]
H[h(u)]
H[h(u)].
D
e
l
e
t
e
(
u
)
Delete(u)
Delete(u): scans
H
[
h
(
u
)
]
H[h(u)]
H[h(u)] and removes
u
u
u if it is present.
Main Goal: to show that randomization can “spreads out” the elements being added, though collision cannot be completely avoided.
Choosing A Good Hash Function
Basic Idea: for every element u ∈ U u \in U u∈U, when we go to insert u u u into S S S, we select a value h ( u ) h(u) h(u) uniformly at random in the set { 0 , 1 , . . . , n − 1 } \{0, 1, . . . , n − 1\} {0,1,...,n−1}, independently of all previous choices.
定理 (12.22): With this uniform random hashing scheme, the probability that two randomly selected values h ( u ) h(u) h(u) and h ( v ) h(v) h(v) collide—that is, that h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v)—is exactly 1 / n 1/n 1/n
证明:
一共有 n 2 n^2 n2 possible pairs of value ( h ( u ) , h ( v ) ) (h(u), h(v)) (h(u),h(v)), all are equal likely.
一共有 n n n pair collide
P ( c o l l i d e ) = 1 n P(collide)=\frac{1}{n} P(collide)=n1
但是,use a hash function with independently random chosen values 是不可行的,因为需要调用 D e l e t e ( u ) Delete(u) Delete(u) 或者 i n s e r t ( u ) insert(u) insert(u) 时,不知道 h ( u ) h(u) h(u) 的值。 如果每个对应值都 write down,占用 space 太大。
Universal Classes of Hash Functions
Choose a hash function at random from a carefully selected class of functions. Each function h h h in our class of functions H \mathcal{H} H will map the universe U U U into the set { 0 , 1 , . . . , n − 1 } \{0,1,...,n−1\} {0,1,...,n−1}.
h h h 性质:
- For any pair of elements u , v ∈ U u, v \in U u,v∈U, the probability that a randomly chosen h ∈ H h \in \mathcal{H} h∈H satisfies h ( u ) = h ( v ) h(u) = h(v) h(u)=h(v) is at most 1 / n 1/n 1/n.
- Each h ∈ H h \in H h∈H can be compactly represented and, for a given h ∈ H h \in H h∈H and u ∈ U u \in U u∈U, we can compute the value h ( u ) h(u) h(u) efficiently.
定义:我们称 a class H \mathcal{H} H of functions is universal 如果这个 class 满足第一条性质。
定理 (13.23): Let
H
\mathcal{H}
H be a universal class of hashing functions mapping a universe
U
U
U to the set
{
0
,
1
,
.
.
.
,
n
−
1
}
\{0, 1, ..., n-1\}
{0,1,...,n−1}, let
S
S
S be an arbitrary subset of
U
U
U of size at most
n
n
n, and let
u
u
u be any element in
U
U
U. We define
X
X
X to be a random variable equal to the number of elements
s
∈
S
s \in S
s∈S for which
h
(
s
)
=
h
(
u
)
h(s) = h(u)
h(s)=h(u), for a random choice of hash function
h
∈
H
h \in \mathcal{H}
h∈H. (Here
S
S
S and
u
u
u are fixed, and the randomness is in the choice of
h
∈
H
h \in \mathcal{H}
h∈H.) Then
E
[
X
]
≤
1
E [X]≤ 1
E[X]≤1.
(任意一个 hash function
h
∈
H
h \in \mathcal{H}
h∈H,任意一个
u
∈
U
u \in U
u∈U,设
X
X
X 为
s
∈
S
s \in S
s∈S 中满足
h
(
s
)
=
h
(
u
)
h(s)=h(u)
h(s)=h(u) 的
s
s
s 的数量)
证明:
For an element s ∈ S s \in S s∈S, we define a random variable X s X_s Xs that is equal to 1 1 1 if h ( s ) = h ( u ) h(s)=h(u) h(s)=h(u), and equal to to 0 0 0 otherwise.
∵ \because ∵ the class of functions is universal
∴ \therefore ∴ E [ X s ] = P r ( X s = 1 ) ≤ 1 n E[X_s] =Pr(X_s=1) ≤ \frac{1}{n} E[Xs]=Pr(Xs=1)≤n1
Since X = ∑ s ≠ u , s ∈ S X s X=\sum_{s \neq u, s \in S} X_s X=∑s=u,s∈SXs, we have
E [ X ] = ∑ s ≠ u , s ∈ S E [ X s ] ≤ ( ∣ S ∣ − 1 ) ⋅ 1 n ≤ 1 E[X]=\sum_{s \neq u, s\in S}E[X_s]\leq (|S|-1)·\frac{1}{n} \leq 1 E[X]=s=u,s∈S∑E[Xs]≤(∣S∣−1)⋅n1≤1
Designing a Universal Class of Hash Functions
We will use a prime number p > ∣ S ∣ = n p > |S| = n p>∣S∣=n as the size of the hash table H H H.
例子
设 U = { words using ≤ 45 characters } U=\{\text{words using ≤ 45 characters}\} U={words using ≤ 45 characters}
characters → \text{characters} \to characters→ ASCII Number [0, 127]
- a → 97 , b → 98 a \to97, b \to 98 a→97,b→98
words: x = ( x 0 , x 1 , . . . x m − 1 ) \text{words: } x= (x_0, x_1, ...x_{m-1}) words: x=(x0,x1,...xm−1) with m = 45 m=45 m=45 and x i ∈ [ 0 , 127 ] x_i \in [0, 127] xi∈[0,127]
e.g. a b a → ( 97 , 98 , 97 ) aba \to (97, 98, 97) aba→(97,98,97)
类似上面例子,为了方便用 integer arithmetic,将 universe U U U 表示为 vector of the form x = ( x 1 , x 2 , . . . , x r ) x= (x_1, x_2, ..., x_r) x=(x1,x2,...,xr) for some integer r r r, where 0 ≤ x i < p 0 \leq x_i < p 0≤xi<p for each i i i.
- first identify U U U with integers in range [ 0 , N − 1 ] [0, N-1] [0,N−1] for some N N N。【首先,将 U U U 中的 N N N 个元素标记为 [ 0 , 1 , . . . , N − 1 ] [0, 1, ..., N-1] [0,1,...,N−1] 】
- then use consecutive blocks of ⌊ log p ⌋ \lfloor\log p \rfloor ⌊logp⌋ bits of u u u to define the corresponding coordinates x i x_i xi. 【其次,Hash table H H H 中一共有 p p p 个 block,而 x i x_i xi 对应 H H H 中的一个位置,因此 x i x_i xi 应为 log p \log p logp 位数。】(这里存疑, x i x_i xi 不应该对应 Hash Table 中的一个 position)
- if U ⊆ [ 0 , N − 1 ] U \subseteq [0, N-1] U⊆[0,N−1], then we will need a number of coordinates r ≈ log N / log n r ≈ \log N/ \log n r≈logN/logn. 【 N N N 是 log N \log N logN 位数, log n \log n logn 是 H H H entries 的个数的位数】
Let
A
\mathcal{A}
A be the set of all vectors of the form
a
=
(
a
1
,
a
2
,
.
.
.
,
a
r
)
a = (a_1, a_2, ..., a_r)
a=(a1,a2,...,ar), where
a
i
∈
[
0
,
p
−
1
]
a_i \in [0, p-1]
ai∈[0,p−1] for each
i
=
1
,
.
.
.
,
r
i = 1, ..., r
i=1,...,r.
For each
a
∈
A
a \in \mathcal{A}
a∈A, we define
h
a
(
x
)
=
(
∑
i
=
1
r
a
i
x
i
)
m
o
d
p
h_a(x)=\Bigg( \sum_{i=1}^{r} a_ix_i \Bigg) \mod p
ha(x)=(i=1∑raixi)modp
定义: H = { h a : a ∈ A } \mathcal{H}=\{h_a: a \in \mathcal{A}\} H={ha:a∈A}
例子 r = 3 r=3 r=3, table size p = 101 p=101 p=101
a b a → ( 97 , 98 , 97 ) aba \to (97, 98, 97) aba→(97,98,97)
Randomly choose a 1 , a 2 , a 3 = 0 , 1 , 2 a_1, a_2, a_3 = 0, 1, 2 a1,a2,a3=0,1,2
∑ i = 1 3 a i x I = 0 × 97 + 1 × 98 + 2 × 97 = 292 m o d 101 = 90 \sum_{i=1}^3 a_i x_I = 0 \times 97+1\times 98 + 2\times 97 = 292 \mod 101 = 90 ∑i=13aixI=0×97+1×98+2×97=292mod101=90
Hence, h a ( a b a ) = 90 h_a(aba)=90 ha(aba)=90
性质:
(0).
h
:
U
→
[
0
,
1
,
.
.
.
,
p
−
1
]
h: U \to[0, 1, ..., p-1]
h:U→[0,1,...,p−1].
(2).
h
a
h_a
ha can be computed by choosing and remembering a random
a
∈
A
a \in \mathcal{A}
a∈A. We can compute
h
a
(
u
)
h_a(u)
ha(u) for all elements
u
∈
U
u \in U
u∈U by
r
r
r multiplications on
log
p
\log p
logp digits.
Analyzing the Data Structure
定理 (13.24): For any prime p p p and any integer z ≠ 0 m o d p z \neq 0 \mod p z=0modp, and any two integers α , β \alpha, \beta α,β, if α z = β z m o d p αz = βz \mod p αz=βzmodp, then α = β m o d p α = β \mod p α=βmodp.
证明:
假设 α z = β z m o d p \alpha z=\beta z \mod p αz=βzmodp
可以得到 ( α − β ) z = 0 m o d p (\alpha-\beta) z = 0 \mod p (α−β)z=0modp,即 ( α − β ) z (\alpha-\beta)z (α−β)z 可以被 p p p 整除
∵ \because ∵ z z z 不能被 p p p 整除,且 p p p 是质数
∴ \therefore ∴ α − β \alpha-\beta α−β 可以被 p p p 整除,即 α = β m o d p \alpha = \beta \mod p α=βmodp
定理 (13.25): The class of linear functions H \mathcal{H} H defined above is universal.
证明:
Let x = ( x 1 , x 2 , . . . , x r ) x=(x1, x_2, ..., x_r) x=(x1,x2,...,xr) and y = ( y 1 , y 2 , . . . , y r ) y=(y_1, y_2, ..., y_r) y=(y1,y2,...,yr) be two distinct elements of U U U.
需要证明:for a randomly chosen a ∈ A , P ( h a ( x ) = h a ( y ) ) ≤ 1 p a \in \mathcal{A}, P(h_a(x)=h_a(y))\leq \frac{1}{p} a∈A,P(ha(x)=ha(y))≤p1
Since x ≠ y x\neq y x=y, there must be an index j j j such that x j ≠ y j x_j \neq y_j xj=yj
Consider a hash function a = ( a 1 , a 2 , . . . , a r ) a = (a_1, a_2, ..., a_r) a=(a1,a2,...,ar) 【除 a j a_j aj 以外均为确定值,现在要选定 a j a_j aj 使 h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y)。
h a ( x ) = ∑ i = 1 r a i x i m o d p h_a(x) = \sum_{i=1}^r a_ix_i \mod p ha(x)=i=1∑raiximodp h a ( y ) = ∑ i = 1 r a i y i m o d p h_a(y) = \sum_{i=1}^r a_i y_i \mod p ha(y)=i=1∑raiyimodp
欲求 P ( h a ( x ) = h a ( y ) ) P\big( h_a(x) = h_a(y) \big) P(ha(x)=ha(y)),即
a j ( x j − y j ) = ∑ i ≠ j a i ( y i − x i ) m o d p a_j (x_j - y_j) = \sum_{i \neq j} a_i (y_i-x_i) \mod p aj(xj−yj)=i=j∑ai(yi−xi)modp 令等式右边为 m m m,令 y j − x j = z y_j - x_j =z yj−xj=z,化简等式为
a j z = m m o d p a_j z=m \mod p ajz=mmodp 其中 x j ≠ y j x_j \neq y_j xj=yj, 即 z ≠ 0 z \neq 0 z=0.
Claim: there is exactly one value 0 ≤ a j < p 0 \leq a_j <p 0≤aj<p that satisfy the above equation.
因为,假设 a j z = a j ′ z = m m o d p a_j z = a_j' z = m \mod p ajz=aj′z=mmodp
那么,由定理 (13.24) 可知, a j = a j ′ m o d p a_j=a_j' \mod p aj=aj′modp
但是, a j , a j ′ < p a_j, a_j'<p aj,aj′<p,所以 a j = a j ′ a_j=a_j' aj=aj′
Hence, the probability of choosing a j a_j aj so that h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y) is 1 p \frac{1}{p} p1.
We have shown that H \mathcal{H} H is a universal class of hash functions.
\,
以下证明为什么 a i , i ≠ j a_i, i\neq j ai,i=j 不会影响 h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y) 的概率。
- Let E E E be the event that h a ( x ) = h a ( y ) h_a(x)=h_a(y) ha(x)=ha(y)
- Let F b \mathcal{F}_b Fb be the event that all coordinates a i a_i ai (for i ≠ j i\neq j i=j) receive a sequence of values b b b.
之前,我们证明了 P ( E ∣ F b ) = 1 / p P\big( E | \mathcal{F}_b \big)=1/p P(E∣Fb)=1/p
全概率公式:
P ( E ) = ∑ b P ( E ∣ F b ) ⋅ P ( F b ) = 1 p ∑ b P ( F b ) = 1 p P(E)=\sum_b P\big( E|\mathcal{F}_b\big)\cdot P(\mathcal{F}_b)=\frac{1}{p}\sum_b P(\mathcal{F}_b)=\frac{1}{p} P(E)=b∑P(E∣Fb)⋅P(Fb)=p1b∑P(Fb)=p1