Selecting a Hash Function
一个好的哈希函数就是近似于正态哈希映射,将元素以正态随机分布到hash table中。
通常,大部分hash方法会假定
k
为一个整数,当
- ### Division method
h(k)=k mod m
需要避免 m 取值为2的指数,这是因为如果m=2p ,哈希函数就变成了 k 的余下的bits值。通常选用的m 是一个不太靠近2的幂的素数。
For example, if we expect to insert around n = 4500 elements into a chained hash
table, we might choose m = 1699, a good prime number between 210 and 211.
This results in a load factor of α = 4500/1699 ≈ 2.6, which indicates that generally
two or three elements will reside in each bucket, assuming uniform hashing.
- ### Multiplication method
h(k)=⌊m(kA mod 1)⌋,A≈5√−12≈0.618
也就是取 (kA mod 1) 的小数部分与 m 相乘。
y=⌊x⌋ ,指y取不超过x的最大整数
这里 m 为hash table的桶数。
For example, if the table contains m = 2000 positions, and we hash the key k = 6341,the hash coding is
.⌊(2000)((6341)(0.618) mod 1⌋=⌊(2000)(3918.738 mod 1)⌋=⌊(2000)(0.738)⌋=1476
Type of Hash Table
Chained hash table
Open-addressed hash table
open-addressed hash table
所装载的元素必须小于table中的位置数( n>m ),因此其装载因子始终小于或等于1
concept
@ load factor
α=nm
其中, n 为待装载的元素个数,m 为hash coding集合中的元素个数。probe
解决open-addressed hash table冲突的方法就是探测hash table.为了插入一个元素,首先需要进行位置探测,直到找到一个空位置,然后插入空位置上。为了移除或查找某个元素,也首先进行探测,直到找到元素或遇到一个空位置。如果在找到元素之前遇到一个空位置,或者已经遍历了所有位置,则该元素不存在。
goal
主要目标是减少探测次数。确切来讲,需要进行多少次探测主要取决于2个因素
- load factor(负载系数)
- the degree to which elements are distributed uniformly(元素均匀分布的程度)假设hash是正态分布,那么需要探测的位置的个数为
11−α
Load Factor(%) | Expexted Probes |
---|---|
< 50 | < 1 / (1 - 0.5) =2 |
80 | 1 / (1-0.8) = 5 |
90 | 1 / (1-0.9) = 10 |
95 | 1 / (1-0.95) = 20 |
实际应用中,逼近表中的结果取决于所选择的逼近uniform hashing的程度,即依赖于我们所选择的hash函数。然而,在open-addressed hash table中,它也取决于我们如何在碰撞发生时探测表中的后续位置。
Generally, a hash function for probing positions in an open-addressed hash table is defined by:
h(k,i)=x
where k is a key, i is the number of times the table has been probed thus far, and x is the resulting hash coding.一种最有效的方法,对open-addressed hash table,就是两个hash函数相加。
h(k,i)=(h1(k)+ih2(k)) mod m
函数 h1(k) , h2(k) 都是辅助hash函数,其选择的方法与其他hash函数相同,尽可能保证元素通过hash映射为正态随机分布。但是,为了确保在所有位置被访问两次之前,所有位置都已经被访问,必须遵守以下过程:
- 选择 m 为2的幂,然后保证h2 始终返回奇数。
- 选择 m 为素数,设计h2 使得它总返回一个小于 m 的正数
典型地,h1(k)=k mod m ,而且 h2(k)=1+(k mod m′) , m′ 略小于 m ,或m−1 , m−2 .for example, if the hash table
contains m = 1699 positions (a prime number) and we hash the key k = 15,385, the positions probed are (94 + (0)(113)) mod 1699 = 94 when i = 0, and every 113th
position after this as i increases.The advantage of double hashing is that it is one of the best forms of probing,
producing a good distribution of elements throughout a hash table.
The disadvantage is that m is constrained in order to ensure that all positions in the table will be visited in a series of probes before any position is probed twice.