Hashing | CS 61B Data Structures, Spring 2019

Hashing

reading: 12.1, 12.2, 12.3, 12.4, 12.5, Algs 458-468, 478-479, 468-475 (extra)
Lecture 19: Hashing

  • Set Implementations, DataIndexedIntegerSet
  • Integer Representations of Strings, Integer Overflow
  • Hash Tables and Handling Collisions
  • Hash Table Performance and Resizing
  • Hash Tables in Java

Why we need Hash tabel?

So far, we’ve looked at a few data structures for efficiently searching for the existence of items within the data structure. We looked at Binary Search Trees, then made them balanced using 2-3 Trees.
However, there are some limitations that these structures impose (yes, even 2-3 trees!)

  1. They require that items be comparable. How do you decide where a new item goes in a BST? You have to answer the question “are you smaller than or bigger than the root”? For some objects, this question may make no sense.
  2. They give a complexity of Θ(logN). Is this good? Absolutely. But maybe we can do better

在这里插入图片描述

A first attempt:DataIndexedIntegerSet

For now, we’re only going to try to improve issue #2 above (improve complexity from Θ(logN) to Θ(1).

One extreme approach:

Create an array of booleans indexed by data:

  • Initially all values are false.
  • When an item is added, set appropriate index to true.
    在这里插入图片描述
public class DataIndexedIntegerSet {
   private boolean[] present;
 
   public DataIndexedIntegerSet() {
       present = new boolean[2000000000];
   }
 
   public void add(int i) {
       present[i] = true;
   }
 
   public boolean contains(int i) {
	   return present[i];
   }
}

The Performance of this approach

在这里插入图片描述

potential issues with this approach

  • Extremely wasteful. If we assume that a boolean takes 1 byte to store, the above needs 2GB of space per new DataIndexedIntegerSet() .
  • Moreover, the user may only insert a handful of items… (现在只能查找整数)
    • What do we do if someone wants to insert a String ? Let’s look at this next. Of course, we may want to insert other things, like Dog s. That’ll come soon!

Solving the word-insertion problem

we create a unique number for every world by using a formula

for exampel:

for word “abcd”, we can write a ⋅10 + b ⋅10 + c ⋅10 + d ⋅10 and that gives us a unique 4 digit number for this word abcd.

Similarly, there are 26 unique characters in the english lowercase alphabet. Why not give each one a number: a = 1,b = 2,…,z = 26. Now, we can write any unique lowercase string in base 26. (Note that base 26 simply means that we will use 26 as the multiplier, much like we used 10 and 2 as examples above.)

  • "cat" = “c” 26 + ‘a’ 26 + ‘t’ 26 _ = 3 6 +1 6 +20 6 = 2074

This representation gives a unique integer to every english word containing lowercase letters, much like using base 10 gives a unique representation to every number. We are guaranteed to not have collisions.

implementing englishToInt (optional) (solution)(the word-insertion problem)

public class DataIndexedEnglishWordSet {
   private boolean[] present;
 
   public DataIndexedEnglishWordSet() {
       present = new boolean[2000000000];
   }
 
   public void add(String s) {
       present[englishToInt(s)] = true;
   }
 
   public boolean contains(int i) {
	   return present[englishToInt(s)];
   }
}
/** Converts ith character of String to a letter number.
    * e.g. 'a' -> 1, 'b' -> 2, 'z' -> 26 */
public static int letterNum(String s, int i) {
	int ithChar = s.charAt(i);
	if ((ithChar < 'a') || (ithChar > 'z'))
        { throw new IllegalArgumentException(); }
	return ithChar - 'a' + 1;
}

public static int englishToInt(String s) {
	int intRep = 0;
	for (int i = 0; i < s.length(); i += 1) {       	
        intRep = intRep * 27;
        intRep = intRep + letterNum(s, i);
	}
	return intRep;
}

Where are we?

Recall, we started with wanting to
(a) Be better than Θ(logN). We’ve now done this for integers and for single english words.

(b) Allow for non-comparable items. We haven’t touched this yet, although we are getting there. So far, we’ve only learnt how to add integers and english words, both of which are comparable, but, have we ever used the fact that they are comparable? I.e., have we ever tried to compare them (like we did in BSTs)? No. So we’re getting there, but we haven’t actually inserted anything non-comparable yet.

© We have data structures that insert integers and english words. Let’s make a quick visit to inserting arbitrary String objects, with spaces and all that. And maybe even insert other languages and emojis!

(d) Further recall that our approach is still very wasteful of memory. We haven’t solved that issue yet!

Inserting Strings and Overflow

Using only lowercase English characters is too restrictive.

  • What if we want to store strings like “2pac” or “eGg!”?
  • To understand what value we need to use for our base, let’s discuss briefly discuss the ASCII standard.

ASCII Characters

The most basic character set used by most computers is ASCII format.

  • Each possible character is assigned a value between 0 and 127.
  • Characters 33 - 126 are “printable”, and are shown below.
  • For example, char c = ’D’ is equivalent to char c = 68.

在这里插入图片描述

Examples:
bee126= (98 x 1262) + (101 x 1261) + (101 x 1260) = 1,568,675
2pac126= (50 x 1263) + (112 x 1262) + (97 x 1261) + (99 x 1260) = 101,809,233
eGg!126= (98 x 1263) + (71 x 1262) + (98 x 1261) + (33 x 1260) = 203,178,213

Implementing asciiToInt

The corresponding integer conversion function is actually even simpler than englishToInt! Using the raw character value means we avoid the need for a helper method.

public static int asciiToInt(String s) {
	int intRep = 0;
	for (int i = 0; i < s.length(); i += 1) {       	
        intRep = intRep * 126;
        intRep = intRep + s.charAt(i);
	}
	return intRep;
}

if we want to use characters beyond ASCII,we must search some help from unicode(一个比ASCII 表范围更广泛的表格)

Integer Overflow and Hash Codes

Major Problem: Integer Overflow

In Java, the largest possible integer is 2,147,483,647.

  • If you go over this limit, you overflow, starting back over at the smallest integer, which is -2,147,483,648.
  • In other words, the next number after 2,147,483,647 is -2,147,483,648.
    在这里插入图片描述
    在这里插入图片描述

Bottom line: Collisions are inevitable.what we can do is to minimize the
Collisions.

Hash Codes

In computer science, taking an object and converting it into some integer is called “computing the hash code of the object”. For instance, the hashcode of “melt banana” is 839099497.

  • We looked at how to compute this hashcode for Strings. For other Objects, there are one of two things we do:
    Every Object in Java has a default .hashcode() method, which we can use. Java computes this by figuring out where the Object sits in memory (every section of the memory in your computer has an address!), and uses that memory’s address to do something similar to what we did with String s. This methods gives a unique hashcode for every single Java object.

  • Sometimes, we write our own hashcode method. For example, given a Dog , we may use a combination of its name , age and breed to generate a hashcode

Properties of HashCodes

Hash codes have three necessary properties, which means a hash code must have these properties in order to be valid:

  1. It must be an Integer
  2. If we run .hashCode() on an object twice, it should return the same number
  3. Two objects that are considered .equal() must have the same hash code.

Not all hash codes are created equal, however. If you want your hash code to be considered a good hash code, it should:

  • Distribute items evenly.

Hash Tables:Handling Collisions(The Separate Chaining Data Indexed Array)

The big idea is to change our array ever so slightly to not contain just items, but instead contain a LinkedList (or any other List) of items. So… Everything in the array is originally empty.

  • If we get a new item, and its hashcode is h h h:
    If there is nothing at index h h h at the moment, we’ll create a new LinkedList for index h h h, place it there, and then add the new item to the newly created LinkedList .

  • If there is already something at index h h h, then there is already a LinkedList there. We simply add our new item to that LinkedList . Note: Our data structure is not allowed to have any duplicate items / keys. Therefore, we must first check whether the item we are trying to insert is already in this LinkedList. If it is, we do nothing! This also means that we will insert to the END of the linked list, since we need to check all of the elements anyways.

Concrete workflow

  • add item :

    • Get hashcode (i.e., index) of item. If index has no item, create new List, and place item there. If index has a List already, check the List to see if item is already in there. If not, add item to List.
  • contains item :

    • Get hashcode (i.e., index) of item. If index is empty, return false . Otherwise, check all items in the List at that index, and if the item exists, return true .

在这里插入图片描述

Runtime Complexity

在这里插入图片描述
Why is contains Θ(Q)?
Because we need to look at all the items in the LinkedList at the hashcode (i.e., index).
Why is add Θ(Q)?
Can’t we just add to the beginning of the LinkedList, which takes Θ(1) time? No! Because we have to check to make sure the item isn’t already in the linked list.

Saving Memory Using Separate Chaining and Modulus

  • Why keep an ArrayList of size 4 billion around?
    Recall that we did that to avoid collisions, because we wanted to be able to add every integer / word / String to our data structure. But now that we allow for collisions anyway, we can relax this a bit.

  • An idea: modulo.
    Let’s just create an ArrayList of size, say, 100. Let’s not change how the hashcode functions behaves (let it return a crazy large integer.) But after we get the hashcode , we’ll take its modulo 100 to get an index within the 0…99 range that we want. 在这里插入图片描述
    在这里插入图片描述

Our Final Data Structure: HashTable

What we’ve created now is called a HashTable .

Dealing with Runtime

the only issue left to solve is the issue of runtime. If we have 100 items, and our ArrayList is of size 5, then

  • In the best case, all items get sent to the different indices evenly. That is, we have 5 linkedLists, and each one contains 20 of the items.
  • In the worst case, all items get sent to the same index! That is, we have just 1 LinkedList, and it has all 100 items.

There are two ways to try to fix this:

  • Dynamically growing our hashtable.
  • Improving our Hashcodes
Dynamically growing the hash table

在这里插入图片描述
在这里插入图片描述
in this case assuming items are evenly distributed (as above), lists will be approximately N/M items long, resulting in Θ(N/M) runtimes.

  • Our doubling strategy ensures that N/M = O(1).
  • Thus, worst case runtime for all operations is Θ(N/M) = Θ(1).

… unless that operation causes a resize:

  • Resizing takes Θ(N) time. Have to redistribute all items!
  • Most add operations will be Θ(1). Some will be Θ(N) time (to resize).
    • Similar to our ALists, as long as we resize by a multiplicative factor, the average runtime will still be Θ(1).
    • Note: We will eventually analyze this in more detail.

在这里插入图片描述

在这里插入图片描述

Hash Tables in Java

In Java, implemented as java.util.HashMap and java.util.HashSet.

  • How does a HashMap know how to compute each object’s hash code?
    • Good news: It’s not “implements Hashable”.
    • Instead, all objects in Java must implement a .hashCode() method.

在这里插入图片描述

Using Negative hash codes in Java

在这里插入图片描述

Hash Tables in Java

在这里插入图片描述

Two Important Warnings When Using HashMaps/HashSets

Warning #1: Never store objects that can change in a HashSet or HashMap!

  • If an object’s variables changes, then its hashCode changes. May
  • result in items getting lost.

Warning #2: Never override equals without also overriding hashCode.

  • Can also lead to items getting lost and generally weird behavior.
  • HashMaps and HashSets use equals to determine if an item exists in a particular bucket.

Example hashCode Function

1.The Java 8 hash code for strings. Two major differences from our hash codes:

  • Represents strings as a base 31 number.
    • Why such a small base? Real hash codes don’t care about uniqueness.
  • Stores (caches) calculated hash code so future hashCode calls are faster.
@Override
public int hashCode() {
    int h = cachedHashValue;
    if (h == 0 && this.length() > 0) {
        for (int i = 0; i < this.length(); i++) {
            h = 31 * h + this.charAt(i);
        }
        cachedHashValue = h;
    }
    return h;
}

2.Hashing a Collection
Lists are a lot like strings: Collection of items each with its own hashCode:

3.Hashing a Recursive Data Structure
Computation of the hashCode of a recursive data structure involves recursive computation.

  • For example, binary tree hashCode (assuming sentinel leaves):
@Override
public int hashCode() {
   if (this.value == null) {
       return 0;
   }
   return  this.value.hashCode() +
   	31 * this.left.hashCode() +
   	31 * 31 * this.right.hashCode();
}

Summary

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值