Advanced ADT:
BBST: AVL, red-black, B tree, B+ tree
Hashing: unordered dictionary
"In an interview, always ask CAN I USE HASH? "
In C++, hashing table is implemented as std::unordered_map
In Python, … is dict()
How to implement
Keys: an abstract object, we can use binary data representing the object as a key and convert it to either a string or a number (such as HEX string or base64 encoding)
So we can assume keys are strings
Try to map the keys into some integer number in a certain integer range, say [0, 65535]
This mapping f should be fast to compute, i.e. linear in the length of the key or quadratic
Hope, the mapped number is a unique number, then by RAM we can find/delete/insert the item in O(1) time
If we want to store key string S with value V, we just put V in the array position f(S)
Hash function
If F is a function that maps from strings to integers with fixed range, then F is a string hash function
A good hash function should have as less COLLISIONS as possible
Consider mapping a string to an integer $(\sum_j P^j s[j]) mod , Q $.
best hash fcn is a 1-1 mapping.
Separate Chaining
If the table is occupied, put a linked list. The maximum length of the list is called load factor. We hope this fator is a constant
Desired property for hash function:
- The hashed keys are nicely spread out so that we do not have too many collisions, since collisions affect the time to perform lookups and deletes
- Table size M = O(N)
- The hash function h is fast to compute
Actually, we want f to be random enough, for each input, if the deterministic function f can encode the input to a nearly random (but deterministic) number, it is good. Functions having such property is called pseudo-randomness.
for example: MD5
“Almost random function” properties
The function is really just like throwing a dart on the target range, i.e. uniform distributed
If hash table size is N = the key domain size
- The load factor is O(log N) in worst case, bu on average it is O(1)
Birthday paradox
- When there are n or more people in a room, what is the chance that two people have the same birthday?
- It turns out that for a table of size 365 you need only 23 keys for a 50% chance of a collision, and as little as 60 keys for a 99% chance.
Open addressing: maintain an array that is some constant factor factor larger than the number of keys and to store all keys directly in this array. Every cell in the array is either empty or contains a key
Load factor λ = n / m \lambda = n/m λ=n/m, where m is the size of the table and n is the size of the key space.
Probe sequence: map a key into a sequence instead of a number.
Linear probing: hash(key) = [ hash(key) mode m, hash(key) mode m + 1, hash(key) mode m + 2, … ]
best case: expected move: 0.5. worst case: n/4 = n/(2n) * 0 + 1/(2n) * n + 1/(2n) * (n-1) +… = (n+1)/4
Quadratic probing: hash(key) = [ hash(key) mode m, hash(key) mode m + 1, hash(key) mode m + 4, hash(key) mode m + 9, … ]
So quadratic probing could possibly jump over large cluster
But one question is: whether they can traverse the entire table
Claim: if m is prime and the table is at least at least half empty, then quadratic probing will always find an empty location. Furthermore, no locations are checked twice.
Implementations of dictionary with comparable keys: BBST
AVL tree is a binary search tree in which:
For every node in the tree, the height of the left and right subtrees differ by at most 1.
Rotations to maintain the property.