Dictionary:
Maintain a set of items each with a key
- insert(item)
- delete(item)
- search(key): return the item with given key or report doesn't exist
Motivation
Dictionaries are perhaps the most popular data structure in CS
Less obvious, using hashing techniques:
- built into most modern programming languages (Python, Perl, Ruby, JavaScript, Java, C++, C#, . . . )
- e.g. best docdist code: word counts & inner product
- implement databases: (DB HASH in Berkeley DB)
- English word → definition (literal dict.)
- English words: for a spelling correction
- word → all web pages containing that word
- username → account object
- compilers & interpreters: names → variables
- network routers: IP address → wire
- network server: port number → socket/app.
- virtual memory: virtual address → physical
- substring search (grep, Google)
- string commonalities (DNA)
- file or directory synchronization
- cryptography: file transfer & identification
How do we solve the dictionary problem?
Simple approach: Direct-access-table
- store items in the array indexed by key
- keys must be non-negative integers (or using two arrays, integers)
- large key range =⇒ large space — e.g. one key of 2256 is bad news.
Solution to 1: “prehash” keys to integers
Solution to 2: hashing
- reduce universe U of all keys(integers) down to reasonable size m for table
- idea: m = , n = #keys in dict
- hash function h: U → {0, 1, . . . , m − 1}
- two keys ki , kj ∈ K collide if h(ki) = h(kj)
Chaining:
the linked list of colliding items in each slot of the table
Simple uniform hashing:
-
each key is equally likely to be hashed to any slot of the table
-
independent of where other keys hashing
Analysis
- expected length of chain for n keys, m slots = n/m = α = load factor
- running time = O(1+α)
Hash functions
- division method: h(k) = k mod m
- multiplication method:
- universal hashing:, where a and b are random ∈ {0, 1, . . . p−1} and p is a large prime (> |U|).
for worst-case keys k1 != k2:
Pr{h(k1) = h(k2)} = 1/m