Java_集合操作_集合中的哈希码不要重复

最新推荐文章于 2021-10-31 19:20:26 发布

tianshuang.me

最新推荐文章于 2021-10-31 19:20:26 发布

阅读量465

点赞数

文章标签：哈希 hashmap

本文链接：https://blog.csdn.net/u012552052/article/details/45011341

版权

在一个列表中查找某值是非常耗费资源的，随机存取的列表是遍历查找，顺序存储列表是链表查找，或者是Collections的二分法查找，但这都不够快，毕竟都是遍历嘛，最快的还要数以Hash开头的集合（如HashMap、HashSet等类）查找，我们以HashMap为例，看看它是如何查找Key值的，代码如下：

package deep;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

public class Client {

    public static void main(String[] args) {
        int size = 10000;
        List<String> list = new ArrayList<String>(size);
        // 初始化
        for (int i = 0; i < size; ++i) {
            list.add("value" + i);
        }
        // 记录开始时间，单位纳秒
        long start = System.nanoTime();
        // 开始查找
        list.contains("value" + (size - 1));
        // 记录结束时间，单位纳秒
        long end = System.nanoTime();
        System.out.println("list执行时间：" + (end - start) + "ns");
        // Map的查找时间
        Map<String, String> map = new HashMap<String, String>(size);
        for (int i = 0; i < size; ++i) {
            map.put("key" + i, "value" + i);
        }
        start = System.nanoTime();
        map.containsKey("key" + (size - 1));
        end = System.nanoTime();
        System.out.println("map执行时间：" + (end - start) + "ns");
    }
}

运行结果：
list执行时间：18005156ns
map执行时间：16535ns

两者的contains方法都是判断是否包含指定值，为何差距如此巨大呢？而且如果数据量增大，差距也会非线性地增大。
我们先来看ArrayList，它的contains就是一个遍历对比，for循环逐个进行遍历，判断equals的返回值是否为true，为true即找到结果，不再遍历，这很简单，不再多说。
我们再来看看HashMap的containsKey方法是如何实现的，代码如下：

    public boolean containsKey(Object key) {
        //判断getEntry是否为空
        return getEntry(key) != null;
    }

getEntry会根据key值查找它的键值对（也就是Entry对象），如果没有找到，则返回null。我们再来看看该方法又是如何实现的，代码如下：

    final Entry<K,V> getEntry(Object key) {
        if (size == 0) {
            return null;
        }
        //计算key的哈希码
        int hash = (key == null) ? 0 : hash(key);
        //定位Entry，indexFor方法是根据hash定位数组的位置的
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash &&
                ((k = e.key) == key || (key != null && key.equals(k))))
                return e;
        }
        return null;
    }

通过indexFor方法定位Entry在数组table中的位置，这是HashMap实现的一个关键点，怎么能根据hashCode定位它在数组中的位置呢？
要解释清楚这个问题，还需要从HashMap的table数组是如何存储元素的说起。首先要说明以下三点：

table数组的长度永远是2的N次幂。
table数组中的元素是Entry类型。
table数组中的元素位置是不连续的。

table数组为何如此设计呢？下面逐步来说明。先来看HashMap是如何插入元素的，代码如下：

    public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        //计算hash码，并定位元素    
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            //哈希码相同，并key相等，则覆盖
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        //插入新元素，或者替换同哈希的旧元素并建立链表
        addEntry(hash, key, value, i);
        return null;
    }

HashMap每次增加元素时都会使用hash方法和indexFor方法将哈希码转变成数组的下标。源代码如下：

    final int hash(Object k) {
        int h = hashSeed;
        if (0 != h && k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }

        h ^= k.hashCode();

        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

    static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }

null值也是可以作为key值的，它的位置永远是在Entry数组的第一位。
现在有一个很重要的问题摆在面前了：哈希运算存在着哈希冲突的问题，即对于一个固定的哈希算法f(k)，允许出现f(k1)=f(k2)，但f1!=f2的情况，也就是说两个不同的Entry，可能产生相同的哈希码，HashMap是如何处理这种冲突的问题的呢？答案是通过链表，每个键值对都是一个Entry，其中每个Entry都有一个next变量，也就是说它会指向下一个键值对——很明显，这应该是一个单向链表，该链表是由addEntry方法完成的，其代码如下：

    void addEntry(int hash, K key, V value, int bucketIndex) {
        if ((size >= threshold) && (null != table[bucketIndex])) {
            resize(2 * table.length);
            hash = (null != key) ? hash(key) : 0;
            bucketIndex = indexFor(hash, table.length);
        }

        createEntry(hash, key, value, bucketIndex);
    }

    void createEntry(int hash, K key, V value, int bucketIndex) {
        //取得当前位置元素
        Entry<K,V> e = table[bucketIndex];
        //生成新的键值对，并进行替换，建立链表
        table[bucketIndex] = new Entry<>(hash, key, value, e);
        size++;
    }

这段程序涵盖了两个业务逻辑：如果新加入的键值对的hashCode是唯一的，那直接插入到数组中，它Entry的next值为null；如果新加入的键值对的hashCode与其他元素冲突，则替换掉数组中的当前值，并把新加入的Entry的next变量指向被替换掉的元素——于是，一个链表就生成了。
HashMap的存储主线还是数组，遇到哈希冲突的时候则使用链表解决。了解了HashMap是如何存储的，查找也就一目了然了：使用hashCode定位元素，若有哈希冲突，则遍历对比，换句话说在没有哈希冲突的情况下，HashMap的查找则是依赖hashCode定位的，因为是直接定位，那效率当然就高了！
知道HashMap的查找原理，我们就应该很清楚：如果哈希码相同，它的查找效率就与ArrayList没什么两样了，遍历对比，性能会大打折扣。特别是在那些进度紧张的项目中，虽重写了hashCode方法但返回值却是固定的，此时如果把这些对象插入到HashMap中，查找就相当耗时了。