Hashing
reading: 12.1, 12.2, 12.3, 12.4, 12.5, Algs 458-468, 478-479, 468-475 (extra)
Lecture 19: Hashing
- Set Implementations, DataIndexedIntegerSet
- Integer Representations of Strings, Integer Overflow
- Hash Tables and Handling Collisions
- Hash Table Performance and Resizing
- Hash Tables in Java
Why we need Hash tabel?
So far, we’ve looked at a few data structures for efficiently searching for the existence of items within the data structure. We looked at Binary Search Trees, then made them balanced using 2-3 Trees.
However, there are some limitations that these structures impose (yes, even 2-3 trees!)
- They require that items be comparable. How do you decide where a new item goes in a BST? You have to answer the question “are you smaller than or bigger than the root”? For some objects, this question may make no sense.
- They give a complexity of Θ(logN). Is this good? Absolutely. But maybe we can do better
A first attempt:DataIndexedIntegerSet
For now, we’re only going to try to improve issue #2 above (improve complexity from Θ(logN) to Θ(1).
One extreme approach:
Create an array of booleans indexed by data:
- Initially all values are false.
- When an item is added, set appropriate index to true.
public class DataIndexedIntegerSet {
private boolean[] present;
public DataIndexedIntegerSet() {
present = new boolean[2000000000];
}
public void add(int i) {
present[i] = true;
}
public boolean contains(int i) {
return present[i];
}
}
The Performance of this approach
potential issues with this approach
- Extremely wasteful. If we assume that a boolean takes 1 byte to store, the above needs 2GB of space per new DataIndexedIntegerSet() .
- Moreover, the user may only insert a handful of items… (现在只能查找整数)
- What do we do if someone wants to insert a String ? Let’s look at this next. Of course, we may want to insert other things, like Dog s. That’ll come soon!
Solving the word-insertion problem
we create a unique number for every world by using a formula
for exampel:
for word “abcd”, we can write a ⋅10 + b ⋅10 + c ⋅10 + d ⋅10 and that gives us a unique 4 digit number for this word abcd.
Similarly, there are 26 unique characters in the english lowercase alphabet. Why not give each one a number: a = 1,b = 2,…,z = 26. Now, we can write any unique lowercase string in base 26. (Note that base 26 simply means that we will use 26 as the multiplier, much like we used 10 and 2 as examples above.)
- "cat" = “c” 26 + ‘a’ 26 + ‘t’ 26 _ = 3 6 +1 6 +20 6 = 2074
This representation gives a unique integer to every english word containing lowercase letters, much like using base 10 gives a unique representation to every number. We are guaranteed to not have collisions.
implementing englishToInt (optional) (solution)(the word-insertion problem)
public class DataIndexedEnglishWordSet {
private boolean[] present;
public DataIndexedEnglishWordSet() {
present = new boolean[2000000000];
}
public void add(String s) {
present[englishToInt(s)] = true;
}
public boolean contains(int i) {
return present[englishToInt(s)];
}
}
/** Converts ith character of String to a letter number.
* e.g. 'a' -> 1, 'b' -> 2, 'z' -> 26 */
public static int letterNum(String s, int i) {
int ithChar = s.charAt(i);
if ((ithChar < 'a') || (ithChar > 'z'))
{ throw new IllegalArgumentException(); }
return ithChar - 'a' + 1;
}
public static int englishToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 27;
intRep = intRep + letterNum(s, i);
}
return intRep;
}
Where are we?
Recall, we started with wanting to
(a) Be better than Θ(logN). We’ve now done this for integers and for single english words.
(b) Allow for non-comparable items. We haven’t touched this yet, although we are getting there. So far, we’ve only learnt how to add integers and english words, both of which are comparable, but, have we ever used the fact that they are comparable? I.e., have we ever tried to compare them (like we did in BSTs)? No. So we’re getting there, but we haven’t actually inserted anything non-comparable yet.
© We have data structures that insert integers and english words. Let’s make a quick visit to inserting arbitrary String objects, with spaces and all that. And maybe even insert other languages and emojis!
(d) Further recall that our approach is still very wasteful of memory. We haven’t solved that issue yet!
Inserting Strings and Overflow
Using only lowercase English characters is too restrictive.
- What if we want to store strings like “2pac” or “eGg!”?
- To understand what value we need to use for our base, let’s discuss briefly discuss the ASCII standard.
ASCII Characters
The most basic character set used by most computers is ASCII format.
- Each possible character is assigned a value between 0 and 127.
- Characters 33 - 126 are “printable”, and are shown below.
- For example, char c = ’D’ is equivalent to char c = 68.
Examples:
bee126= (98 x 1262) + (101 x 1261) + (101 x 1260) = 1,568,675
2pac126= (50 x 1263) + (112 x 1262) + (97 x 1261) + (99 x 1260) = 101,809,233
eGg!126= (98 x 1263) + (71 x 1262) + (98 x 1261) + (33 x 1260) = 203,178,213
Implementing asciiToInt
The corresponding integer conversion function is actually even simpler than englishToInt! Using the raw character value means we avoid the need for a helper method.
public static int asciiToInt(String s) {
int intRep = 0;
for (int i = 0; i < s.length(); i += 1) {
intRep = intRep * 126;
intRep = intRep + s.charAt(i);
}
return intRep;
}
if we want to use characters beyond ASCII,we must search some help from unicode(一个比ASCII 表范围更广泛的表格)
Integer Overflow and Hash Codes
Major Problem: Integer Overflow
In Java, the largest possible integer is 2,147,483,647.
- If you go over this limit, you overflow, starting back over at the smallest integer, which is -2,147,483,648.
- In other words, the next number after 2,147,483,647 is -2,147,483,648.
Bottom line: Collisions are inevitable.what we can do is to minimize the
Collisions.
Hash Codes
In computer science, taking an object and converting it into some integer is called “computing the hash code of the object”. For instance, the hashcode of “melt banana” is 839099497.
-
We looked at how to compute this hashcode for Strings. For other Objects, there are one of two things we do:
Every Object in Java has a default .hashcode() method, which we can use. Java computes this by figuring out where the Object sits in memory (every section of the memory in your computer has an address!), and uses that memory’s address to do something similar to what we did with String s. This methods gives a unique hashcode for every single Java object. -
Sometimes, we write our own hashcode method. For example, given a Dog , we may use a combination of its name , age and breed to generate a hashcode
Properties of HashCodes
Hash codes have three necessary properties, which means a hash code must have these properties in order to be valid:
- It must be an Integer
- If we run .hashCode() on an object twice, it should return the same number
- Two objects that are considered .equal() must have the same hash code.
Not all hash codes are created equal, however. If you want your hash code to be considered a good hash code, it should:
- Distribute items evenly.
Hash Tables:Handling Collisions(The Separate Chaining Data Indexed Array)
The big idea is to change our array ever so slightly to not contain just items, but instead contain a LinkedList (or any other List) of items. So… Everything in the array is originally empty.
-
If we get a new item, and its hashcode is h h h:
If there is nothing at index h h h at the moment, we’ll create a new LinkedList for index h h h, place it there, and then add the new item to the newly created LinkedList . -
If there is already something at index h h h, then there is already a LinkedList there. We simply add our new item to that LinkedList . Note: Our data structure is not allowed to have any duplicate items / keys. Therefore, we must first check whether the item we are trying to insert is already in this LinkedList. If it is, we do nothing! This also means that we will insert to the END of the linked list, since we need to check all of the elements anyways.
Concrete workflow
-
add item :
- Get hashcode (i.e., index) of item. If index has no item, create new List, and place item there. If index has a List already, check the List to see if item is already in there. If not, add item to List.
-
contains item :
- Get hashcode (i.e., index) of item. If index is empty, return false . Otherwise, check all items in the List at that index, and if the item exists, return true .
Runtime Complexity
Why is contains Θ(Q)?
Because we need to look at all the items in the LinkedList at the hashcode (i.e., index).
Why is add Θ(Q)?
Can’t we just add to the beginning of the LinkedList, which takes Θ(1) time? No! Because we have to check to make sure the item isn’t already in the linked list.
Saving Memory Using Separate Chaining and Modulus
-
Why keep an ArrayList of size 4 billion around?
Recall that we did that to avoid collisions, because we wanted to be able to add every integer / word / String to our data structure. But now that we allow for collisions anyway, we can relax this a bit. -
An idea: modulo.
Let’s just create an ArrayList of size, say, 100. Let’s not change how the hashcode functions behaves (let it return a crazy large integer.) But after we get the hashcode , we’ll take its modulo 100 to get an index within the 0…99 range that we want.
Our Final Data Structure: HashTable
What we’ve created now is called a HashTable .
Dealing with Runtime
the only issue left to solve is the issue of runtime. If we have 100 items, and our ArrayList is of size 5, then
- In the best case, all items get sent to the different indices evenly. That is, we have 5 linkedLists, and each one contains 20 of the items.
- In the worst case, all items get sent to the same index! That is, we have just 1 LinkedList, and it has all 100 items.
There are two ways to try to fix this:
- Dynamically growing our hashtable.
- Improving our Hashcodes
Dynamically growing the hash table
in this case assuming items are evenly distributed (as above), lists will be approximately N/M items long, resulting in Θ(N/M) runtimes.
- Our doubling strategy ensures that N/M = O(1).
- Thus, worst case runtime for all operations is Θ(N/M) = Θ(1).
… unless that operation causes a resize:
- Resizing takes Θ(N) time. Have to redistribute all items!
- Most add operations will be Θ(1). Some will be Θ(N) time (to resize).
- Similar to our ALists, as long as we resize by a multiplicative factor, the average runtime will still be Θ(1).
- Note: We will eventually analyze this in more detail.
Hash Tables in Java
In Java, implemented as java.util.HashMap and java.util.HashSet.
- How does a HashMap know how to compute each object’s hash code?
- Good news: It’s not “implements Hashable”.
- Instead, all objects in Java must implement a .hashCode() method.
Using Negative hash codes in Java
Hash Tables in Java
Two Important Warnings When Using HashMaps/HashSets
Warning #1: Never store objects that can change in a HashSet or HashMap!
- If an object’s variables changes, then its hashCode changes. May
- result in items getting lost.
Warning #2: Never override equals without also overriding hashCode.
- Can also lead to items getting lost and generally weird behavior.
- HashMaps and HashSets use equals to determine if an item exists in a particular bucket.
Example hashCode Function
1.The Java 8 hash code for strings. Two major differences from our hash codes:
- Represents strings as a base 31 number.
- Why such a small base? Real hash codes don’t care about uniqueness.
- Stores (caches) calculated hash code so future hashCode calls are faster.
@Override
public int hashCode() {
int h = cachedHashValue;
if (h == 0 && this.length() > 0) {
for (int i = 0; i < this.length(); i++) {
h = 31 * h + this.charAt(i);
}
cachedHashValue = h;
}
return h;
}
2.Hashing a Collection
Lists are a lot like strings: Collection of items each with its own hashCode:
3.Hashing a Recursive Data Structure
Computation of the hashCode of a recursive data structure involves recursive computation.
- For example, binary tree hashCode (assuming sentinel leaves):
@Override
public int hashCode() {
if (this.value == null) {
return 0;
}
return this.value.hashCode() +
31 * this.left.hashCode() +
31 * 31 * this.right.hashCode();
}