hash

最新推荐文章于 2022-09-15 16:49:32 发布

AdmireLinux

最新推荐文章于 2022-09-15 16:49:32 发布

阅读量655

点赞数

分类专栏： C 文章标签： hash 函数

本文链接：https://blog.csdn.net/AdmireLinux/article/details/62231261

版权

C 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Selecting a Hash Function

一个好的哈希函数就是近似于正态哈希映射，将元素以正态随机分布到hash table中。

h (k) = x

$h(k)=x$ 其中，

x $x$ 称为

k $k$ 的 hash coding.

通常，大部分hash方法会假定 $k$ 为一个整数，当 $k$ 不是整数时，可以将其转换为整数。

### Division method
$h (k) = k m o d m$ $h(k) = k\ mod\ m$
需要避免 $m$ 取值为2的指数，这是因为如果 $m = 2^{p}$ ,哈希函数就变成了 $k$ 的余下的bits值。通常选用的 $m$ 是一个不太靠近2的幂的素数。

For example, if we expect to insert around n = 4500 elements into a chained hash
table, we might choose m = 1699, a good prime number between 210 and 211.
This results in a load factor of α = 4500/1699 ≈ 2.6, which indicates that generally
two or three elements will reside in each bucket, assuming uniform hashing.

### Multiplication method
$h (k) = ⌊ m (k A m o d 1) ⌋, A \approx 5 \sqrt - 1 2 \approx 0.618$ $h(k) = \lfloor m(kA\ mod\ 1) \rfloor, A\approx \frac{\sqrt{5}-1}{2}\approx0.618$
也就是取 $(kA\ mod\ 1)$ 的小数部分与 $m$ 相乘。

$y = \lfloor x \rfloor$ ,指y取不超过x的最大整数
这里 $m$ 为hash table的桶数。
For example, if the table contains m = 2000 positions, and we hash the key k = 6341,the hash coding is

$⌊ (2000) ((6341) (0.618) m o d 1 ⌋ = ⌊ (2000) (3918.738 m o d 1) ⌋ = ⌊ (2000) (0.738) ⌋ = 1476$ $\lfloor (2000)((6341)(0.618)\ mod\ 1 \rfloor = \lfloor (2000)(3918.738\ mod\ 1) \rfloor = \lfloor (2000)(0.738) \rfloor = 1476$ .

Type of Hash Table

Chained hash table

Open-addressed hash table

open-addressed hash table
所装载的元素必须小于table中的位置数( $n > m$ ),因此其装载因子始终小于或等于1

concept

@ load factor

$α = n m$ ${\alpha} = \frac{n}{m}$
其中， $n$ 为待装载的元素个数， $m$ 为hash coding集合中的元素个数。

probe

解决open-addressed hash table冲突的方法就是探测hash table.为了插入一个元素，首先需要进行位置探测，直到找到一个空位置，然后插入空位置上。为了移除或查找某个元素，也首先进行探测，直到找到元素或遇到一个空位置。如果在找到元素之前遇到一个空位置，或者已经遍历了所有位置，则该元素不存在。

goal

主要目标是减少探测次数。确切来讲，需要进行多少次探测主要取决于2个因素
- load factor(负载系数)
- the degree to which elements are distributed uniformly(元素均匀分布的程度)

假设hash是正态分布，那么需要探测的位置的个数为

$1 1 - α$ $\frac{1}{1-\alpha}$

Load Factor(%)	Expexted Probes
< 50	< 1 / (1 - 0.5) =2
80	1 / (1-0.8) = 5
90	1 / (1-0.9) = 10
95	1 / (1-0.95) = 20

实际应用中，逼近表中的结果取决于所选择的逼近uniform hashing的程度，即依赖于我们所选择的hash函数。然而，在open-addressed hash table中，它也取决于我们如何在碰撞发生时探测表中的后续位置。

Generally, a hash function for probing positions in an open-addressed hash table is defined by:

$h (k, i) = x$ $h(k, i) = x$
where k is a key, i is the number of times the table has been probed thus far, and x is the resulting hash coding.

一种最有效的方法，对open-addressed hash table,就是两个hash函数相加。

$h (k, i) = (h 1 (k) + i h 2 (k)) m o d m$ $h(k,i) = (h_1(k) + i h_2(k))\ mod\ m$
函数 $h_1(k)$ , $h_2(k)$ 都是辅助hash函数，其选择的方法与其他hash函数相同，尽可能保证元素通过hash映射为正态随机分布。但是，为了确保在所有位置被访问两次之前，所有位置都已经被访问，必须遵守以下过程：
- 选择 $m$ 为2的幂，然后保证 $h_2$ 始终返回奇数。
- 选择 $m$ 为素数，设计 $h_2$ 使得它总返回一个小于 $m$ 的正数
典型地， $h_1(k) = k\ mod\ m$ ,而且 $h_2(k) = 1 + (k\ mod\ m')$ , $m'$ 略小于 $m$ ，或 $m-1$ , $m-2$ .

for example, if the hash table
contains m = 1699 positions (a prime number) and we hash the key k = 15,385, the positions probed are (94 + (0)(113)) mod 1699 = 94 when i = 0, and every 113th
position after this as i increases.

The advantage of double hashing is that it is one of the best forms of probing,
producing a good distribution of elements throughout a hash table.
The disadvantage is that m is constrained in order to ensure that all positions in the table will be visited in a series of probes before any position is probed twice.