Hash Table: Supported Operations
Purpose:maintain a (possibly evolving) set of stuff.
(transactions, people+associated data, IP address, etc)
Insert: add new record.
Delete: delete existing record.
Lookup: check for a particular record (a “dictionary”)
应用:
1. Application: De-Duplication
Given: a “stream” of objects.
(Linear scan through a huge file. Or objects arriving in real time)
Goal: remove duplicates (keep track of unique objects)
⋅
⋅
report unique visitors to web site
⋅
⋅
avoid duplicates in search results.
Solution: when new object x arrives
⋅
⋅
lookup x in hash table H
⋅
⋅
if not found, Insert x into H.
2. The 2-SUM Problem
Input: unsorted array A of n integers. Target sum t.
Goal: determine whether or not there are two numbers x, y in A with
x+y=t
x
+
y
=
t
Naive Solution:
θ(n2)
θ
(
n
2
)
time via exhaustive search
Better:
1.) sort A (
θ(nlog(n))
θ
(
n
l
o
g
(
n
)
)
time)
2.) for each x in A, look for t-x in A via binary search.
Amazing:
1.) insert elements of A into hash table H.
2.) for each x in A, Lookup t-x ,
θ(n)
θ
(
n
)
time.
3. Futher Immediate Applications
⋅
⋅
Historical application : symbol tables in compilers.
⋅
⋅
Blocking network traffic.
⋅
⋅
Search algorithms (game tree exploration)
⋅⋅
⋅
⋅
Use hash table to avoid exploring any configuration
(arrangement of chess pieces ) more than once.
4. High-Level Idea.
Setup: universe U[all IP addersses, all names, all chessboard configurations,etc] [generally really big]
Goal: wnat to maintain evolving set
S⊆U
S
⊆
U
[generally, of reasonable size].
Solution:
1.) pick n = numbers of buckets.
2.) choose a hash function: take a key as input return the position between
0
0
and .
h:U→{0,1,2,...,n−1}
h
:
U
→
{
0
,
1
,
2
,
.
.
.
,
n
−
1
}
.
3.) use array A of length n, store x in A[h(x)].
关于: Naive Solutions:
1. Array-based solution [indexed by u]
⋅
⋅
O(1)
O
(
1
)
operations by
θ(|U|)
θ
(
|
U
|
)
space.
2. List-based solution.
θ(|S|)
θ
(
|
S
|
)
space but
θ(|S|)
θ
(
|
S
|
)
Lookup.
5. Resolving Collisions.
Collision: distinct
x,y∈U
x
,
y
∈
U
such that
h(x)=h(y)
h
(
x
)
=
h
(
y
)
,hash function: 不同的键值返回同样的position。
1.) Solution #1: (separate) chaning,
⋅
⋅
keep linked list in each bucket.
⋅
⋅
given a key/object x, perform Insert/Delete/Loopup in the list in A[h(x)]. (A:linked list for x, h(x): Bucket for x).
2.) Solution #2: open addressing. (only one object per bucket)
⋅
⋅
Hash function now specifies probe sequence
h1(x),h2(x)...
h
1
(
x
)
,
h
2
(
x
)
.
.
.
⋅
⋅
Examples: linear probing(look consecutively),(17 then 18,19..)
Double hashing.(the first one specifies initial bucket that you probe, the second one specify the offset for each subsequent probe).
Definition: the load factor of a hash table is:
α=#of−objetcs−in−hash−table#of−buckets−of−hash−table
α
=
#
o
f
−
o
b
j
e
t
c
s
−
i
n
−
h
a
s
h
−
t
a
b
l
e
#
o
f
−
b
u
c
k
e
t
s
−
o
f
−
h
a
s
h
−
t
a
b
l
e
Note:
1.)
α
α
= O(1) is necessary condition for operations to run in constant time.
2.) with open addressing, need
α
α
<< 1. (only one object per bucket)
6. Pathological Data Sets(病态数据集)
Upshot#2: for god HT performance, need a good hash function.
Ideal(理想): user super-clever hash function guaranteed to spread every data set out evenly.
Problem: DOES NOT EXIST!(for every hash function, there is a pathological data set)
Reason: fix a hash function h:
U→{0,1,...,n−1}
U
→
{
0
,
1
,
.
.
.
,
n
−
1
}
⇒
⇒
Pigeonhole Principle(鸽巢原理), there exist bucket i such that at least
|U|n
|
U
|
n
elements of U hash to l under h.
⇒
⇒
if data set drawn only from these, everything collides!
7. Pathological Data in the Real World.
Main Point: can paralyze several real-world systems by exploiting badly designed hash functions.
−−
−
−
open source.
−−
−
−
overly simplistic hash function.
(easy to reverse engineer a pathological data set)
Solutions
1. Use a cryptographic hash function(e.g., SHA-2)
−−
−
−
infeasible to reverse engineer a pathological data set.
2. Use randomization.
−−
−
−
design a family H of hash functions such that for all datasets S, “almost all”functions
h∈H
h
∈
H
spread S out “pretty evenly”.
Universal Hash Functions
Definition: Let H be a set of hash functions from U to
{0,1,2,...,n−1}
{
0
,
1
,
2
,
.
.
.
,
n
−
1
}
.
H is universal if and only if :
for all x,y in U(with
x≠y
x
≠
y
)
Prh∈H[x,y,collide]≤1n
P
r
h
∈
H
[
x
,
y
,
c
o
l
l
i
d
e
]
≤
1
n
(collide:
h(x)=h(y)
h
(
x
)
=
h
(
y
)
),
When h is chosen uniformly at random from H.
i..e,collisionprobabilityassmallaswith"goldstanard"
i
.
.
e
,
c
o
l
l
i
s
i
o
n
p
r
o
b
a
b
i
l
i
t
y
a
s
s
m
a
l
l
a
s
w
i
t
h
"
g
o
l
d
s
t
a
n
a
r
d
"
of perfectly random hashing.
Example: Hashing IP Addresses.
Let U = IP addresses (of the form(
x1,x2,x3,x4
x
1
,
x
2
,
x
3
,
x
4
)),with each
xi∈{0,1,2,...,255}
x
i
∈
{
0
,
1
,
2
,
.
.
.
,
255
}
Let n = a prime(small multiple of # of objects in HT)
Construction:Define one hash function has per 4-tuple a = (
a1,a2,a3,a4
a
1
,
a
2
,
a
3
,
a
4
) with each
ai∈{0,1,2,3,...,n−1}
a
i
∈
{
0
,
1
,
2
,
3
,
.
.
.
,
n
−
1
}
.
Define:
ha
h
a
: IP addrs
→
→
buckets by
ha(x1,x2,x3,x4)=(a1x1+a2x2+a3x3+a4x4)mod,n
h
a
(
x
1
,
x
2
,
x
3
,
x
4
)
=
(
a
1
x
1
+
a
2
x
2
+
a
3
x
3
+
a
4
x
4
)
m
o
d
,
n
A Universal Hash Function
Define:
H={ha|a1,a2,a3,a4∈{0,1,2,...,n−1}}
H
=
{
h
a
|
a
1
,
a
2
,
a
3
,
a
4
∈
{
0
,
1
,
2
,
.
.
.
,
n
−
1
}
}
ha(x1,x2,x3,x4)=(a1x1+a2x2+a3x3+a4x4)mod(n)
h
a
(
x
1
,
x
2
,
x
3
,
x
4
)
=
(
a
1
x
1
+
a
2
x
2
+
a
3
x
3
+
a
4
x
4
)
m
o
d
(
n
)
Theorem: This family is universal.
Proof:(Part 1)
Consider distinct IP addresses(
x1
x
1
,
x2
x
2
,
x3
x
3
,
x4
x
4
), (
y1
y
1
,
y2
y
2
,
y3
y
3
,
y4
y
4
).
Assume:
x4≠y4
x
4
≠
y
4
Note: collision
⇔
⇔
a1x1+a2x2+a3x3+a4x4=a1y1+a2y2+a3y3+a4y4
a
1
x
1
+
a
2
x
2
+
a
3
x
3
+
a
4
x
4
=
a
1
y
1
+
a
2
y
2
+
a
3
y
3
+
a
4
y
4
⇔
⇔
a4(x4−y4)=∑3i=1ai(yi−xi)mod(n)
a
4
(
x
4
−
y
4
)
=
∑
i
=
1
3
a
i
(
y
i
−
x
i
)
m
o
d
(
n
)
Proof (Part II)
The story So Far: with
a1,a2,a3
a
1
,
a
2
,
a
3
fixed arbitrarily, how many choices of
a4
a
4
satisfy
a4(x4−y4)=∑3i=1ai(yi−xi)mod(n)
a
4
(
x
4
−
y
4
)
=
∑
i
=
1
3
a
i
(
y
i
−
x
i
)
m
o
d
(
n
)
.
Key Claim: left-hand side equally likely to be any of {0,1,2,…,n-1}
Reason:
x4≠y4
x
4
≠
y
4
.
Bloom Filter(布隆滤波器): Supported Operations.
Fast Inserts and Lookups.
Comparison to Hash Tables.
Pros: more space efficient
Cons:
1) can’t store an associated object.
2) No deletions.
3) Small false positive probability.
(might say x has been inserted even though it has’t been )
Applications:
Original: early spellcheckers.
Canonical(规范): list of forbidden passwords.
Modern: network routers,
−−
−
−
Limited memory, need to be super-fast.
Bloom Filter: Under the Hood:
Ingredients:
1) array of n bits.
(So
n|S|
n
|
S
|
= # of bits per object in the data set S)
2) k hash functions
h1,...,hk
h
1
,
.
.
.
,
h
k
(k = small constant)
Insert(x) :
⋅
⋅
for i = 1, 2, …, k
−−
−
−
set A[
hi(x)
h
i
(
x
)
] = 1
Lookup(x): return TRUE
⇔
⇔
A[
hi(x)
h
i
(
x
)
] = 1 for every i = 1,2,…,k.
Note: no false negatives:
(if x was inserted, Loopup(x) guaranteed to succeed).
But : false positive if all k
hi(x)′s
h
i
(
x
)
′
s
already set to 1 by other insertions.
Heuristic(启发式) Analysis
Intuition: should be a trade-off between space and error (false positive)
probability.
Assume: all
hi(x)′s
h
i
(
x
)
′
s
uniformaly random and independent.
Setup: n bits, insert data set S into bloom filter.
Note: for each bit of A, the probability it’s been set to 1 is (under above assumption):
1−(1−1n)k|S|≤1−e−k|S|n=1−e−kb
1
−
(
1
−
1
n
)
k
|
S
|
≤
1
−
e
−
k
|
S
|
n
=
1
−
e
−
k
b
b=# of bits per object (n/|S|)
Story so far: probability a given bit 1 is
≤1−e−kb
≤
1
−
e
−
k
b
So: under assumption, for x not in S, false positve probality is
≤[1−e−kb]k
≤
[
1
−
e
−
k
b
]
k
Error rank
ϵ
ϵ
where b = # of bits per object.
How to set k ?: for fixed b ,
ϵ
ϵ
is minimized by setting
Plugging back in :
ϵ≈(12)(ln2)b
ϵ
≈
(
1
2
)
(
l
n
2
)
b
or
b≈1.44log21ϵ
b
≈
1.44
l
o
g
2
1
ϵ
k≈(ln2)⋅b
k
≈
(
l
n
2
)
⋅
b