Algorithms2-week4-hashtable

Hash Table: Supported Operations
Purpose:maintain a (possibly evolving) set of stuff.
(transactions, people+associated data, IP address, etc)
Insert: add new record.
Delete: delete existing record.
Lookup: check for a particular record (a “dictionary”)
应用:
1. Application: De-Duplication
Given: a “stream” of objects.
(Linear scan through a huge file. Or objects arriving in real time)
Goal: remove duplicates (keep track of unique objects)
report unique visitors to web site
avoid duplicates in search results.
Solution: when new object x arrives
lookup x in hash table H
if not found, Insert x into H.
2. The 2-SUM Problem
Input: unsorted array A of n integers. Target sum t.
Goal: determine whether or not there are two numbers x, y in A with
x+y=t x + y = t
Naive Solution: θ(n2) θ ( n 2 ) time via exhaustive search
Better:
1.) sort A ( θ(nlog(n)) θ ( n l o g ( n ) ) time)
2.) for each x in A, look for t-x in A via binary search.
Amazing:
1.) insert elements of A into hash table H.
2.) for each x in A, Lookup t-x , θ(n) θ ( n ) time.
3. Futher Immediate Applications
Historical application : symbol tables in compilers.
Blocking network traffic.
Search algorithms (game tree exploration)
⋅ ⋅ Use hash table to avoid exploring any configuration
(arrangement of chess pieces ) more than once.
4. High-Level Idea.
Setup: universe U[all IP addersses, all names, all chessboard configurations,etc] [generally really big]
Goal: wnat to maintain evolving set SU S ⊆ U
[generally, of reasonable size].
Solution:
1.) pick n = numbers of buckets.
2.) choose a hash function: take a key as input return the position between 0 0 and n1. h:U{0,1,2,...,n1} h : U → { 0 , 1 , 2 , . . . , n − 1 } .
3.) use array A of length n, store x in A[h(x)].
关于: Naive Solutions:
1. Array-based solution [indexed by u]
O(1) O ( 1 ) operations by θ(|U|) θ ( | U | ) space.
2. List-based solution. θ(|S|) θ ( | S | ) space but θ(|S|) θ ( | S | ) Lookup.
5. Resolving Collisions.
Collision: distinct x,yU x , y ∈ U such that h(x)=h(y) h ( x ) = h ( y ) ,hash function: 不同的键值返回同样的position。
1.) Solution #1: (separate) chaning,
keep linked list in each bucket.
given a key/object x, perform Insert/Delete/Loopup in the list in A[h(x)]. (A:linked list for x, h(x): Bucket for x).
2.) Solution #2: open addressing. (only one object per bucket)
Hash function now specifies probe sequence h1(x),h2(x)... h 1 ( x ) , h 2 ( x ) . . .
Examples: linear probing(look consecutively),(17 then 18,19..)
Double hashing.(the first one specifies initial bucket that you probe, the second one specify the offset for each subsequent probe).
Definition: the load factor of a hash table is:
α=#ofobjetcsinhashtable#ofbucketsofhashtable α = # o f − o b j e t c s − i n − h a s h − t a b l e # o f − b u c k e t s − o f − h a s h − t a b l e
Note:
1.) α α = O(1) is necessary condition for operations to run in constant time.
2.) with open addressing, need α α << 1. (only one object per bucket)
6. Pathological Data Sets(病态数据集)
Upshot#2: for god HT performance, need a good hash function.
Ideal(理想): user super-clever hash function guaranteed to spread every data set out evenly.
Problem: DOES NOT EXIST!(for every hash function, there is a pathological data set)
Reason: fix a hash function h: U{0,1,...,n1} U → { 0 , 1 , . . . , n − 1 }
Pigeonhole Principle(鸽巢原理), there exist bucket i such that at least |U|n | U | n elements of U hash to l under h.
if data set drawn only from these, everything collides!
7. Pathological Data in the Real World.
Main Point: can paralyze several real-world systems by exploiting badly designed hash functions.
− − open source.
− − overly simplistic hash function.
(easy to reverse engineer a pathological data set)
Solutions
1. Use a cryptographic hash function(e.g., SHA-2)
− − infeasible to reverse engineer a pathological data set.
2. Use randomization.
− − design a family H of hash functions such that for all datasets S, “almost all”functions hH h ∈ H spread S out “pretty evenly”.
Universal Hash Functions
Definition: Let H be a set of hash functions from U to
{0,1,2,...,n1} { 0 , 1 , 2 , . . . , n − 1 } .
H is universal if and only if :
for all x,y in U(with xy x ≠ y )
PrhH[x,y,collide]1n P r h ∈ H [ x , y , c o l l i d e ] ≤ 1 n (collide: h(x)=h(y) h ( x ) = h ( y ) ),
When h is chosen uniformly at random from H.
i..e,collisionprobabilityassmallaswith"goldstanard" i . . e , c o l l i s i o n p r o b a b i l i t y a s s m a l l a s w i t h " g o l d s t a n a r d " of perfectly random hashing.
Example: Hashing IP Addresses.
Let U = IP addresses (of the form( x1,x2,x3,x4 x 1 , x 2 , x 3 , x 4 )),with each xi{0,1,2,...,255} x i ∈ { 0 , 1 , 2 , . . . , 255 }
Let n = a prime(small multiple of # of objects in HT)
Construction:Define one hash function has per 4-tuple a = ( a1,a2,a3,a4 a 1 , a 2 , a 3 , a 4 ) with each ai{0,1,2,3,...,n1} a i ∈ { 0 , 1 , 2 , 3 , . . . , n − 1 } .
Define: ha h a : IP addrs buckets by
ha(x1,x2,x3,x4)=(a1x1+a2x2+a3x3+a4x4)mod,n h a ( x 1 , x 2 , x 3 , x 4 ) = ( a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 ) m o d , n
A Universal Hash Function
Define: H={ha|a1,a2,a3,a4{0,1,2,...,n1}} H = { h a | a 1 , a 2 , a 3 , a 4 ∈ { 0 , 1 , 2 , . . . , n − 1 } }
ha(x1,x2,x3,x4)=(a1x1+a2x2+a3x3+a4x4)mod(n) h a ( x 1 , x 2 , x 3 , x 4 ) = ( a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 ) m o d ( n )
Theorem: This family is universal.
Proof:(Part 1)
Consider distinct IP addresses( x1 x 1 , x2 x 2 , x3 x 3 , x4 x 4 ), ( y1 y 1 , y2 y 2 , y3 y 3 , y4 y 4 ).
Assume: x4y4 x 4 ≠ y 4
Note: collision
a1x1+a2x2+a3x3+a4x4=a1y1+a2y2+a3y3+a4y4 a 1 x 1 + a 2 x 2 + a 3 x 3 + a 4 x 4 = a 1 y 1 + a 2 y 2 + a 3 y 3 + a 4 y 4
a4(x4y4)=3i=1ai(yixi)mod(n) a 4 ( x 4 − y 4 ) = ∑ i = 1 3 a i ( y i − x i ) m o d ( n )
Proof (Part II)
The story So Far: with a1,a2,a3 a 1 , a 2 , a 3 fixed arbitrarily, how many choices of a4 a 4 satisfy
a4(x4y4)=3i=1ai(yixi)mod(n) a 4 ( x 4 − y 4 ) = ∑ i = 1 3 a i ( y i − x i ) m o d ( n ) .
Key Claim: left-hand side equally likely to be any of {0,1,2,…,n-1}
Reason: x4y4 x 4 ≠ y 4 .
Bloom Filter(布隆滤波器): Supported Operations.
Fast Inserts and Lookups.
Comparison to Hash Tables.
Pros: more space efficient
Cons:
1) can’t store an associated object.
2) No deletions.
3) Small false positive probability.
(might say x has been inserted even though it has’t been )
Applications:
Original: early spellcheckers.
Canonical(规范): list of forbidden passwords.
Modern: network routers,
− − Limited memory, need to be super-fast.
Bloom Filter: Under the Hood:
Ingredients:
1) array of n bits.
(So n|S| n | S | = # of bits per object in the data set S)
2) k hash functions h1,...,hk h 1 , . . . , h k (k = small constant)
Insert(x) :
for i = 1, 2, …, k
− − set A[ hi(x) h i ( x ) ] = 1
Lookup(x): return TRUE
A[ hi(x) h i ( x ) ] = 1 for every i = 1,2,…,k.
Note: no false negatives:
(if x was inserted, Loopup(x) guaranteed to succeed).
But : false positive if all k hi(x)s h i ( x ) ′ s already set to 1 by other insertions.
Heuristic(启发式) Analysis
Intuition: should be a trade-off between space and error (false positive)
probability.
Assume: all hi(x)s h i ( x ) ′ s uniformaly random and independent.
Setup: n bits, insert data set S into bloom filter.
Note: for each bit of A, the probability it’s been set to 1 is (under above assumption):
1(11n)k|S|1ek|S|n=1ekb 1 − ( 1 − 1 n ) k | S | ≤ 1 − e − k | S | n = 1 − e − k b
b=# of bits per object (n/|S|)

Story so far: probability a given bit 1 is 1ekb ≤ 1 − e − k b
So: under assumption, for x not in S, false positve probality is
[1ekb]k ≤ [ 1 − e − k b ] k Error rank ϵ ϵ
where b = # of bits per object.
How to set k ?: for fixed b , ϵ ϵ is minimized by setting
Plugging back in :
ϵ(12)(ln2)b ϵ ≈ ( 1 2 ) ( l n 2 ) b or b1.44log21ϵ b ≈ 1.44 l o g 2 1 ϵ
k(ln2)b k ≈ ( l n 2 ) ⋅ b

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值