文章目录
Hash Tables and Symbol Tables Application
Hash Tables
- Save items in a key-indexed table (index is a function of the key)
- Issues.
- Computing the hash function
- Equality test: Method for checking whether two keys are equal.
- Collision resolution: Algorithm and data structure to handle two keys that hash to the same array index
Hash functions
Idealistic goal: Scramble the keys uniformly to produce a table index
-
Java’s hash code conventions
- All Java classes inherit a method
hashCode()
, which returns a 32-bitint
, signed integer. - Requirement. If
x.equals(y)
, thenx.hashCode() == y.hashCode()
. - Highly desirable. If
!x.equals(y)
, thenx.hashCode() != y.hashCode()
. - Default implementation. Memory address of
x
. - Customized implementation.
Integer
,Double
,String
,File
,URL
,Data
, … - User defined
- All Java classes inherit a method
-
Implementing hash code
Integer
public final class Integer { private final int value; ... public int hashCode() { return value; } }
Boolean
public final class Boolean { private final boolean value; ... public int hashCode() { if (value) return 1231; else return 1237; } }
Double
public final class Double { private final double value; ... public int hashCode() { long bits = doubleToLongBits(Value); return (int) (bits ^ (bits >>> 32)); } }
String
public final class String { private final char[] s; ... public int hashCode() { int hash = 0; for (int i = 0; i < s.length(); i++) hash = s[i] + (31 * hash); return hash; } }
String
is immutable, when string contents change, assign a new string to the reference. See Java basic.
-
User-defined types
public final class Transaction implements Comparable<Transaction> { private final String who; private final Date when; private final double amount; ... public int hashCode() { int hash = 17; // no zero constant hash = 31*hash + who.hashCode(); // for reference types, use hashCode(). hash = 31*hash + when.hashCode(); hash = 31*hash + ((Double) amount).hashCode(); // for primitive types, use hashCode() of wrapper type. return hash; } }
-
Standard recipe for user-defined types
- Combine each significant field using the
31x+y
rule. - If field is primitive type, use wrapper type hashCode().
- If field if null, return 0.
- If field is a reference type, use hashCode().
- If field is a array, apply to each entry, or use
Arrays.deepHashCode()
- Combine each significant field using the
-
In practice. This recipe works reasonably well; used in Java libraries.
-
Modular hashing, the correct code is
private in hash(Key key) { // 0x7fffffff means all 32 bits are 1 except the first bit (signed bit) is 0. // key.hashCode() & 0x7fffffff, change the signed bit of negative number to 0. return (key.hashCode() & 0x7fffffff)) % M; }
-
Integer Literals Details
- Decimal, as usual, positive or negative (-)
- Octal number, using a leading
0
(zero) digit and one or more additional octal digits (digits between0
and7
), eg.077
- Hexadecimal, using the form
0x
(or0X
) followed by one or more hexadecimal digits (digits from0
to9
,a
tof
orA
toF
). For example,0xCAFEBABEL
is the long integer3405691582
. Like octal numbers, hexadecimal literals may represent negative numbers. - For example.
int i1 = 0x7fffffff; Integer.toBinaryString(i1)
, output0111 1111 1111 1111 1111 1111 1111 1111
without space.
Separate Chaining
- Use an array of M < N linked lists. N is the total of keys, M is number of hash buckets.
- Implementation SeparateChainingST.java
public class SeparateChainingHashST<Key, Value> { private int M = 97; // number of chains private Node[] st = new Node[M]; // array of chains private static class Node { private Object key; // no generic array creation private Object val; // (declare key and value of type Object) private Node next; } private int hash(Key key) { return (key.hashCode() & 0x7fffffff)) % M; } public Value get(Key key) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) { if (key.equals(x.key)) return (Value) x.val; // cast to Value type } return null; } public void put(Key key, Value val) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) { if (key.equals(x.key)) { x.val = val; return; } } st[i] = new Node(key, val, st[i]); // add the k-v pair to st[i] chain. } }
Linear Probing (Open addressing)
- Open addressing. When a new key collides, find next empty slot, and put it there.
- Linear Probing process:
- Hash. Map key to integer
i
between0
andM-1
- Insert. Put at table index
i
if free; if not tryi+1
,i+2
etc. - Search. Search table index
i
; if occupied but no match, tyri+1
,i+2
etc. - Note: Array size
M
must be greater than number of key-value pairsN
.
- Hash. Map key to integer
- Implementation LinearProbingHashST
public class LinearProbingHashST { private int M = 30001; private Value[] vals = (Value[]) new Object[M]; // array doubling and halving code omitted private Key[] keys = (Key[]) new Object[M]; private int hash(Key key) { /* as before */ } public void put(Key key, Value val) { int i; for (int i = hash(key); keys[i] != null; i = (i+1) % M) if (keys[i].equals(key)) break; keys[i] = key; vals[i] = val; } public Value get(Key key) { for (int i = hash(key); keys[i] != null; i = (i+1) % M) if (keys[i].equals(key)) return vals[i]; return null; } }
- Clustering: New keys likely to hash into middle of big clusters.
- Knuth’s parking problem
- Analysis of linear probing
- M too large --> to many empty array entries
- M too small --> search time blows up
- Typical choice:
alpha = N / M ~ (1/2)
, # probes for search hit is about3/2
, # probes for search miss is about5/2
. – Knuth
Hash Table Context
- War story: algorithmic complexity attacks
- Surprising situations: denial-of-service attacks. malicious adversary learns your hash function and causes a big pile-up in single slot that grinds performance to a halt
- Algorithmic complexity attack on Java
- Goal. Find family of strings with the same hash code
- solution. The base 31 hash code is part of Java’s string API.
- Diversion: one-way hash functions
- One-way hash function. “Hard” to find a key that will hash to a desired value (or two keys that hash to same value).
- Applications. Digital fingerprint, message digest, storing password.
- Caveat, Too expensive for use in ST implementations.
- Hashing: variations on the theme
- Two-probe hashing. (separate-chaining variant), hash to two positions, insert key in shorter of the two chains.
- Double hashing. (linear-probing variant), use linear probing, but skip a variable amount, not just 1 each time.
- Cuckoo hashing (linear-probing variant)
Hash tables vs.balanced search trees (in Java)
- Hash tables
- Simpler to code
- Keys are unordered, no effective alternative for unordered keys.
- Faster for simple keys (a few arithmetic ops versus
logN
compares). - Better system support in Java for strings (e.g., cached hash code).
- Balanced search trees
- Stronger performance guarantee.
- Support for ordered ST operations.
- Easier to implement
compareTo()
correctly thanequals()
andhashCode()
.
- Java system includes both
- Red-black BSTs:
java.util.TreeMap, java.util.TreeSet
. - Hash tables:
java.util.HashMap, java.util.HashSet, java.util.IdentityHashMap, java.util.LinkedHashSet
- Red-black BSTs:
Symbol Table Applications
Set
API
-
Mathematical set: A collection of distinct keys.
-
Exception filter, WhiteList or BlackList…
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MsSZcSrc-1586029264255)(…/…/.gitbook/assets/st-application.png)]
Dictionary lookup
- Command-line arguments
- A comma-separated value (CSV) file.
- Key field, Value field
- Pick any field as the key, and any field as the value.
- Ex 1. DNS lookup
- Ex 2. Amino acids
- Ex 3. Class list
LookupCSV.java
public class LookupCSV {
public static void main(String[] args) {
// process input file
In in = new In(args[0]);
int keyField = Integer.parseInt(args[1]);
int valField = Integer.parseInt(args[2]);
// build symbol tables
ST<String, String> st = new ST<>();
while (!in.isEmpty) {
String line = in.readLine();
String[] tokens = line.split(",");
String key = tokens[keyField];
String val = tokens[valField];
st.put(key, val);
}
while(!StdIn.isEmpty()) {
String s = StdIn.readString();
if (!st.contains(s)) StdOut.println("Not found");
else StdOut.println(st.get(s));
}
}
}
Indexing Clients
- File indexing
-
Goal. Index a PC (or the web). Given a list of files specified, create an index so that you can efficiently find all files containing a given query string.
-
Solution, Key = query string; value = set of files containing that string.
-
Book index
-
Implementation FileIndex.java
-
/*****************************************************************************
Compilation: javac FileIndex.java
Execution: java FileIndex file1.txt file2.txt file3.txt ...
Dependencies: ST.java SET.java In.java StdIn.java StdOut.java
Data files: https://algs4.cs.princeton.edu/35applications/ex1.txt
https://algs4.cs.princeton.edu/35applications/ex2.txt
https://algs4.cs.princeton.edu/35applications/ex3.txt
https://algs4.cs.princeton.edu/35applications/ex4.txt
% java FileIndex ex*.txt
age
ex3.txt
ex4.txt
best
ex1.txt
was
ex1.txt
ex2.txt
ex3.txt
ex4.txt
% java FileIndex *.txt
% java FileIndex *.java
*****************************************************************************/
import edu.princeton.cs.algs4.*;
import java.io.File;
public class FileIndex {
public static void main(String[] args) {
ST<String, SET<File>> st = new ST<String, SET<File>>();
for (String filename : args) {
// list of file names from command line
File file = new File(filename);
In in = new In(file);
while (!in.isEmpty()) {
String key = in.readString();
if (!st.contains(key))
st.put(key, new SET<File>());
SET<File> set = st.get(key);
set.add(file);
}
}
while (!StdIn.isEmpty()) {
String query = StdIn.readString();
StdOut.println(st.get(query));
}
}
}
- Concordance, similar to File index, see slides.
- Index and inverted index. LookupIndex.java
Sparse vector
- Vector representations of Symbol table
- key = index, value = entry.
- Efficient iterator
- Space proportional to number of nonzeros.
- Implementation SparseVector.java
import edu.princeton.cs.algs4.StdOut;
import java.util.HashMap;
public class SparseVector {
private HashMap<Integer, Double> v; // Use Hash Symbol Tables (HashST), because order not important
public SparseVector() { v = new HashMap<>(); }
public void put(int i, double x) { v.put(i, x); }
public double get(int i) {
if (!v.containsKey(i)) return 0.0;
else return v.get(i);
}
public Iterable<Integer> indices() {
// return all nonzero indices
return v.keySet();
}
public double dot(double[] that) {
double sum = 0.0;
for (int i : indices())
sum += that[i]*this.get(i);
return sum;
}
public static void main(String[] args) {
double[] a = {1, 0, 0, 0, 0, 1, 0, 2};
SparseVector sv = new SparseVector();
for (int i = 0; i < a.length; i++) {
if (a[i] != 0) sv.put(i, a[i]);
}
double[] b = {1, 1, 1, 1, 1, 1, 1 ,1};
StdOut.println(sv.dot(b));
}
}