哈希函数及哈希冲突

最新推荐文章于 2023-07-17 11:57:57 发布

qq345oo

最新推荐文章于 2023-07-17 11:57:57 发布

阅读量420

点赞数

分类专栏： java 文章标签：哈希算法散列表算法

本文链接：https://blog.csdn.net/qq345oo/article/details/116015307

版权

java 专栏收录该内容

23 篇文章 2 订阅

订阅专栏

哈希的两个重要知识点

hash函数的确定 hash函数应当尽量将key值均匀分配在hash表中
hash冲突的避免 无限多的数据统一到有限多的集合中,冲突难免, 出现冲突该如何解决

一: hash函数如何选择

如果key值是数字, 那么会有多个和数学相关的函数可以选择

取模
直接地址法 (就是带入一个线性公式中例如 H(key)=key或H(key) = a·key + b，其中a和b为常数)
平法取中法 (先计算出关键字值的平方，然后取平方值中间几位作为散列地址)
随机数法,等等.

如果key值是字符串, 也经历了一定的发展, 才找到相对可靠的hash函数

把组成字符串的所有字符的ascii 相加, (弊端: 1: 字符调整顺序,hash可能相同, 2: 如果hash表较大, key的长度较小, 即使所有字符的ascii相加,也不会特别大, 因此hash就不会被分配到索引较大的位置,导致分配不均)
String类的Hash函数也使用了该方法来计算hash,只不过37 变为了31 ,他们都是素数,差别不大

 public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

二: hash冲突如何避免

拉链法: hash冲突了, 用链表把冲突的值串一起,下次查找的时候把链表遍历一下
开放地址法 : hash冲突了,用数学办法再找一个空闲的位置存储, 不同的数学办法也可以将开放定址法区分为线性探查法、二次探查法、双重散列法等

线性探查法被分配的位置是i 但是i被占了, 那就找i+1,再被占,找i+2,依次往下.
二次探查法带进 hi=(h(key)+i*i) ％ m，0 ≤ i ≤ m-1 这个公式, 第一次结果冲突,就i+1 再算一遍.
双重散列法带进 hi=(h(key)+i*h1(key)) ％ m，0 ≤ i ≤ m-1, 比二次探查法好在,他有两个hash函数,一个h,一个h1, 也就是说双重散列法的步长更随机, 分布也就更均匀些

三: HashMap 的Hash函数

HashMap 为了解决Hash冲突,用了很多办法,包括拉链地址法,node数量超过8个之后转为红黑树等 ; 其中hash函数中, hashcode值与高16位的异或操作也起到了一定的作用

/**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

先翻译一下

计算key.hashCode(),并通过异或方法将计算出的hashcode的高位扩展至低位, 因为hash表的数量是2的幂,所以在此范围内的hash总是冲突.(最常见的例子就是在一个小hash表内,存入Float类型的Key值数据) 因此，我们应用了向下传播更高位的影响的变换。在速度，实用性和位扩展质量之间需要权衡。由于许多常见的哈希集已经合理分布（因此无法从扩展中受益），并且由于我们使用树来处理容器中的大量冲突，因此我们仅以最便宜的方式对一些移位后的位进行XOR，以减少系统损失，以及合并最高位的影响. 否则由于表范围的限制，这些hashcode的高位将永远不会在索引计算中使用.

在HahMap中,很多地方都使用了hash()函数,看关键的一个地方,插入数据的putVal方法:
这个地方,hsah是由hash()函数计算获得,n是hashTable 的桶个数(bin个数,也可以理解为数组大小),如果判断tab[i] == null,即当前位置没有数据,就插入该node.

/**
     * Implements Map.put and related methods.
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
            //这个地方,hsah是由hash()函数计算过得,n是hashTable 的桶个数(bin个数,也可以理解为数组大小)
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {

下面解释下为什么要与高16位异或

现假设桶大小就是默认16 : new HashMap();存入两组值 {“hello”:“1”},{“world”: “2”};

现在假设只是正常求 “hello” 和 “world” 的hashCode(),不与高16位异或, 值分别是

0000,0101,1110,1001,0001,1000,1101,0010
0000,0110,1100,0001,0001,1011,1001,0010

此时调用 (n - 1) & hash, 即 15 & hash

   "hello" : 0000,0101,1110,1001,0001,1000,1101,0010     &
             0000,0000,0000,0000,0000,0000,0000,1111
             结果是                         ===> 0010     = 2

   "world" : 0000,0110,1100,0001,0001,1011,1001,0010    &
             0000,0000,0000,0000,0000,0000,0000,1111
             结果是                         ===> 0010     = 2

两者hash冲突了,所以当其中有一个值存在时,再插入另一个值时,只能存入链表或者红黑树中了

当我们使用了HashMap的hash()函数之后,hash值不止关心低16位,而是受到高16位的影响.下面是与高16位异或之后的值

0000,0101,1110,1001,0001,1101,0011,1011
0000,0110,1100,0001,0001,1101,0101,0011

此时再调用 (n - 1) & hash, 即 15 & hash, 结果为

   "hello" : 0000,0101,1110,1001,0001,1101,0011,1011     &
             0000,0000,0000,0000,0000,0000,0000,1111
             结果是                         ===> 1011     = 11

   "world" :0000,0110,1100,0001,0001,1101,0101,0011    &
             0000,0000,0000,0000,0000,0000,0000,1111
             结果是                         ===> 0011    = 3

hash不在冲突了,插入数据的时候不需要解决hash冲突,自然效率更高些.

总结: HashMap的hash()函数,通过与高16位进行异或,将高位的影响扩展至低位,降低了hash冲突的概率.

qq345oo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
哈希函数及哈希冲突

hash函数的确定 hash函数应当尽量将key值均匀分配在hash表中hash冲突的避免无限多的数据统一到有限多的集合中,冲突难免, 出现冲突该如何解决/** * Computes key.hashCode() and spreads (XORs) higher bits of hash * to lower. Because the table uses power-of-two masking, sets of * hashes .
复制链接

扫一扫

专栏目录