【Data Structures】 10. Hashing—Mission Possible

Converting words to numbers, more specifically integers.


Workaround 1: Open Addressing (mainly linear probing)

Step size:

In linearing probing, the step size is always 1 that means the probe goes to x, x+1, x+2, x+3 and so on.


Clustering:

A sequence of filled cells in a hash table that is long.

As a hash table becomes more and more full, the clusters grow larger and larger.

- When the hash table is half full, the performance is still not bad.

- However, it is proven that, when it is beyond two-thirds full, the performance degrades seriously.

It is critical to ensure that a hash table never becomes full (Ideally, more than two-thirds full).


Load factor

It is the ratio of the number of data items in a hash table to the length of the array.

In linear probing, search time becomes really slow as the load factor approaches 1.


How to solve it?

Rehashing

First, it is necessary to create a new array that is mostly twice bigger than the old array length but it depends on the load factor you provided.

The hash method then calculates the location of a given data item based on the new array length.

Second, we need to go through the old array, cell by cell, and insert them by calling hash function over and over.

It's a time-consuming process.


In Open Addressing, there are two other major collision resolution mechanisms: Quadratic Probing and Double Hashing.


Workaround 2: Separate Chaining

In open addressing, collisions are resolved by looking for an open cell in the hash table.

Another approach is to put a linked list at each index in the hash table.

In separate chaining, it is normal to put n or more items in an array of length n.

Finding the initial cell takes O(1) whereas searching through a linked list takes O(k) when there are k number of elements in the list.

Thus, we do not want the linked lists become too full either.

Especially, if your hash function is not good.

However, the load factor in separate chaining can rise above 1 without hurting performance too much assuming hashCode method is good.


When in doubt, you may consider to use separate chaining, especially, if the number of items that will be inserted into a hash table is unknown. In other words, separate chaining would be better when you would expect to have a high load factor.


A few ways to deal with collisions.

Linear Probing:

When there is a collision, we try to find an empty cell sequentially and put the value into the nearest empty cell. However, this approach has an issue of forming the primary clusters and the performance can get really bad. It is necessary to rehash to keep the load factor from time to time.

There is also Quadratic Probing that has another subtle clustering issue called secondary clustering due to the fixed interval of probing.

To solve this clustering issue, there is another workaround, called Double Hashing, which uses two hash functions. One is to calculate hash value and the other is to decide the step size of probing.

Separate Chaining: The other workaround is to have a linked list at each index. This allows us to not to worry too much about load factor. However, we do not want to make the linked lists become too full either.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值