How Large should Table be?
- want m = Θ(n) at all times
Idea
Start small (constant) and grow (or shrink) at necessary
Rehashing
To grow or shrink table hash function must change
- must rebuild hash table from scratch
- Θ(n + m) time = Θ(n), if m = Θ(n)
How fast to grow
When n reaches m, say
- m += 1, rebuild every step, n inserts cost Θ(n^2)
- m *= 2, rebuild at insertion 2^i, n inserts cost Θ(n)
- a few inserts cost linear time, but Θ(1) “on average”
Amortized Analysis
This is a common technique in data structures
- an operation has amortized cost T(n) if k operations cost ≤ k · T(n)
- “T(n) amortized” roughly means T(n) “on average”, but averaged over all ops.
- e.g. inserting into a hash table takes O(1) amortized time.
Back to hashing
Maintain m = Θ(n) =⇒ α = Θ(1) =⇒ support search in O(1) expected time (assuming simple uniform or universal hashing)
Deletion
Also, O(1) expected as is.
- space can get big with respect to n e.g. n× insert, n× delete
- solution: when n decreases to m/4, shrink to half the size =⇒ O(1) amortized cost for both insert and delete
Resizable Arrays
list.append and list.pop in O(1) amortized
String Matching
Given two strings s & t: does s occur as a substring of t
Simple Algorithm:
any(s == t[i : i + len(s)] for i in range(len(t) − len(s)))
O(|s|) time for each substring comparison
O(|s| · (|t| − |s|)) time = O(|s| · |t|) potentially quadratic
Karp-Rabin Algorithm
Rolling Hash ADT:
Maintain string x subject to
- r(): reasonable hash function h(x) on string x
- r.append(c): add letter c to end of string x
- r.skip(c): remove the front letter from string x, assuming it is c
Karp-Rabin Application:
for c in s:
rs.append(c)
for c in t[:len(s)]:
rt.append(c)
if rs() == rt(): ...
O(|s|)
for i in range(len(s), len(t)):
rt.skip(t[i-len(s)])
rt.append(t[i])
if rs() == rt(): ...
O(|t|) + O(#matches*|s|)
Data Structure:
Treat string x as a multi-digit number u in base a where a denotes the alphabet size, e.g., 256
- r() = u mod p for (ideally random) prime p ≈ |s| or |t| (division method)
- r stores u mod p and |x| (really ), not u ⇒ smaller and faster to work with (u mod p fits in one machine word)
- r.append(c): (u·a + ord(c)) mod p = [(u mod p) · a + ord(c)] mod p
- r.skip(c): [u − ord(c) · (mod p)] mod p = [(u mod p) − ord(c) · ( mod p)] mod p