collection java集合框架&hashmap源码解析

最新推荐文章于 2021-02-27 16:32:51 发布

pNull

最新推荐文章于 2021-02-27 16:32:51 发布

阅读量210

点赞数

分类专栏： JAVA核心技术系列

本文链接：https://blog.csdn.net/u011138533/article/details/100022154

版权

JAVA核心技术系列专栏收录该内容

17 篇文章 0 订阅

订阅专栏

前言

1 接口和抽象类

2 集合框架类图

2.1 List：比较 ArrrayList、LinkedList

2.2 Set：比较 TreeSet 、HashSet、LinkedHashSet

2.3 Map：比较 Hashtable、HashMap、TreeMap

3 HashMap 源码分析

3.1 HashMap 内部结构

3.2 来解释一下Node<k,v> 里的hash是什么 ?

3.3 putVal——从hashMap 初始化、扩容、到树化

3.4 为什么hashmap的容量是2的幂数

前言

集合，它是数学中的一个基本概念，表示由一个或多个确定的元素所构成的整体。

由这个概念，我们其实可以理解，像 Integer、Boolen 这些基本类型的包装类，都没有整体的概念，所以并不属于集合框架范畴，而 map、list、 queue 这些数据类型有集合的概念，所以属于集合框架的范畴。

1 接口和抽象类

接口和抽象类是Java 面向对象设计的两个基础机制。

接口是对行为的抽象，主要达到 API 定义和实现分离的目的。抽象类其目的主要是代码复用。Java 类实现 interface 使用 implements，继承 abstract class 则是使用 extends 关键字。

一个类只能extends一个父类，但可以implements多个接口。与此同时，一个接口则可以同时extends多个接口，却不能implements任何接口。因而，Java中的接口是支持多继承的。

接口、抽象类、类有如下的区别：

支持多重继承：接口支持；抽象类不支持；类不支持；
支持抽象函数：接口语义上支持；抽象类支持；类不支持；
允许函数实现：接口不允许；抽象类支持；类允许；
允许实例化：接口不允许；抽象类不允许；类允许；
允许部分函数实现：接口不允许；抽象类允许；类不允许。
定义的内容：接口中只能包括public函数以及public static final常量；抽象类与类均无任何限制。

使用时机：当想要支持多重继承，或是为了定义一种类型请使用接口；当打算提供带有部分实现的“模板”类，而将一些功能需要延迟实现请使用抽象类；当你打算提供完整的具体实现请使用类。

在Java 标准库里面，定义了非常多的接口，比如Java.util.List；在Collection框架中，很多通用的部分就被抽取成为抽象类，例如Java.util.AbstractList.

2 集合框架类图

(?：手画图有点Low~)

我们可以看到Java 集合框架， Collection 接口是所有集合的根，然后扩展三大类集合，分别是：List、Set、Queue，每种集合的通用逻辑，都被抽象到相应的抽象类中。

List：有序集合

Set：不允许有重复的元素，这是和List 最明显的区别，用于保证元素唯一性的场合。

Queue：Java 提供的标准队列结构，它支持先入先出（FIFO），或者后入先出（LIFO）等特定行为。这里不包括BlockingQueue，BlockingQueue多用于并发场合，所以放在并发包。

2.1 List：比较 ArrrayList、LinkedList

List 的框架单独画一下：

这三者都实现了集合框架中的List，也就是所谓的有序集合。所以功能比较相近，比如都提供了按照位置进行查找、添加、删除等操作。但是因为具体的实际的区别（见集合框架类图），在使用、性能、线程安全等方面，又有一些区别。

1、底层实现：ArrayList、Vector 是通过数据来实现的; LinkedList 采用双链表来实现。

2、读写性能

ArrayList：

查找——查找元素的时候要遍历数组，对于非null 的元素采用 equals的方法
删除——删除数组的时候不会缩小数组的容量
插入——插入元素时如果超过当前数据定义最大值时，数据要扩容，扩容的时候要进行大量的数组复制操作

LinkedList：

查找——遍历链表
删除——遍历链表，找到要删除的元素
插入——必须先建一个新的Entry对象，并更新相应元素的前后引用

综上，ArrayList 对于元素的增加和删除都会引起数组的内存分配空间变化，所以插入删除数据慢，但是检索速度快。LinkedList 是基于链表存放数据的，增加和删除元素的速度快，但是检索速度慢。

3、ArrayList 和 LinkedList 都是线程不安全的。但是在Collections 工具类中，提供了一系列的 sychronized 方法。

2.2 Set：比较 TreeSet 、HashSet、LinkedHashSet

set 不允许有重复的元素，这是和List 最明显的区别，用于保证元素唯一性的场合。

TreeSet：支持顺序访问，但是添加、删除等操作效率低
HashSet：不保证有序，利用哈希算法，如果哈希散列正常，可以提供 O(1) 的添加、删除
LinkedHashSet：内部构建了一个记录插入顺序的双向链表，因此提供了按照插入顺序遍历的能力。插入删除效率低于HashSet，因为维护链表有开销。

2.3 Map：比较 Hashtable、HashMap、TreeMap

Map的整体架构再单独画一下：

Hashtable：线程安全的，本身是同步的，方法函数采用synchronized修饰。key/value 都不可以为Null。
hashMap：不是同步的，线程不安全，key/value 可以为Null。put 和 get 操作的时间复杂度是O(1)。如果HashMap需要同步：（1）可以从Collections的synchronizedMap方法；（2）使用ConcurrentHashMap 类。
LinkedHashMap：按照插入顺序排序
TreeMap：基于红黑树的一种提供顺序访问的Map，实现了sortedMap接口，按照key排序，put、get、remove 的操作都是O(log(n))的时间复杂度。

3 HashMap 源码分析

一句话总结hashmap：HashMap基于哈希思想，实现对数据的读写。当我们将键值对传递给put()方法时，它调用键对象的hashCode()方法来计算hashcode，让后找到bucket位置来储存值对象。当获取对象时，通过键对象的equals()方法找到正确的键值对，然后返回值对象。HashMap使用链表来解决碰撞问题，当发生碰撞了，对象将会储存在链表的下一个节点中。 HashMap在每个链表节点中储存键值对对象。当两个不同的键对象的hashcode相同时，它们会储存在同一个bucket位置的链表中，可通过键对象的equals()方法用来找到键值对。如果链表大小超过阈值（TREEIFY_THRESHOLD, 8），链表就会被改造为树形结构。

3.1 HashMap 内部结构

HashMap 内部结构是由数据和链表实现的，数组被分为一个个桶（bucket），通过哈希值决定了键值对在这个数据的寻址；哈希值相同的键值对，则以链表的形式存储，如果链表大小超过阈值（TREEIFY_THRESHOLD, 8），链表就会被改造为树形结构。

从源码中，我们可以看到，HashMap 的初始化数组大小是16，链表的大小是8，负载系数（load factor）是0.75

public class HashMap<K,V> extends AbstractMap<K,V>
    implements Map<K,V>, Cloneable, Serializable {

    private static final long serialVersionUID = 362498820763181265L;

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

    /**
     * The bin count threshold for using a tree rather than list for a
     * bin.  Bins are converted to trees when adding an element to a
     * bin with at least this many nodes. The value must be greater
     * than 2 and should be at least 8 to mesh with assumptions in
     * tree removal about conversion back to plain bins upon
     * shrinkage.
     */
    static final int TREEIFY_THRESHOLD = 8;

    /**
     * The bin count threshold for untreeifying a (split) bin during a
     * resize operation. Should be less than TREEIFY_THRESHOLD, and at
     * most 6 to mesh with shrinkage detection under removal.
     */
    static final int UNTREEIFY_THRESHOLD = 6;

    /**
     * The smallest table capacity for which bins may be treeified.
     * (Otherwise the table is resized if too many nodes in a bin.)
     * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
     * between resizing and treeification thresholds.
     */
    static final int MIN_TREEIFY_CAPACITY = 64;

    /**
     * Basic hash bin node, used for most entries.  (See below for
     * TreeNode subclass, and in LinkedHashMap for its Entry subclass.)
     */
    static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;

        Node(int hash, K key, V value, Node<K,V> next) {
            this.hash = hash;
            this.key = key;
            this.value = value;
            this.next = next;
        }

        public final K getKey()        { return key; }
        public final V getValue()      { return value; }
        public final String toString() { return key + "=" + value; }

        public final int hashCode() {
            return Objects.hashCode(key) ^ Objects.hashCode(value);
        }

        public final V setValue(V newValue) {
            V oldValue = value;
            value = newValue;
            return oldValue;
        }
}

再来看一下，数组（Node<K, V>[] table）和链表在怎么定义的呢？

transient Node<K,V>[] table;   //数组

static class Node<K,V> implements Map.Entry<K,V> {   //节点
        final int hash;   //hash值，并不是key的hashcode，原因是为了解决哈希碰撞
        final K key; 
        V value;
        Node<K,V> next;   //实现链表

        Node(int hash, K key, V value, Node<K,V> next) {
            this.hash = hash;
            this.key = key;
            this.value = value;
            this.next = next;
        }
}

3.2 来解释一下Node<K,V> 里的hash是什么 ?

我们在put 数据的时候 putVal(hash(key), key, value, false, true); 可以看出来，hash 并不是key的hashcode，而是hashcode 值把高位移到低位，然后进行异或运算，这是因为有些数据计算出来的哈希值的差异主要集中在高位，而HashMap 里的哈希寻址是忽略容量以上的高位的，那么这种处理就可以有效避免类似情况下的哈希碰撞。

/* ---------------- Static utilities -------------- */

    /**
     * Computes key.hashCode() and spreads (XORs) higher bits of hash
     * to lower.  Because the table uses power-of-two masking, sets of
     * hashes that vary only in bits above the current mask will
     * always collide. (Among known examples are sets of Float keys
     * holding consecutive whole numbers in small tables.)  So we
     * apply a transform that spreads the impact of higher bits
     * downward. There is a tradeoff between speed, utility, and
     * quality of bit-spreading. Because many common sets of hashes
     * are already reasonably distributed (so don't benefit from
     * spreading), and because we use trees to handle large sets of
     * collisions in bins, we just XOR some shifted bits in the
     * cheapest possible way to reduce systematic lossage, as well as
     * to incorporate impact of the highest bits that would otherwise
     * never be used in index calculations because of table bounds.
     */
    static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }

3.3 putVal——从hashMap 初始化、扩容、到树化

我们再来看一下 putVal 方法，其本身的逻辑非常集中，从hashMap 初始化、扩容、到树化，全部都和这个方法有关。

/**
     * Implements Map.put and related methods
     *
     * @param hash hash for key
     * @param key the key
     * @param value the value to put
     * @param onlyIfAbsent if true, don't change existing value
     * @param evict if false, the table is in creation mode.
     * @return previous value, or null if none
     */
    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
                for (int binCount = 0; ; ++binCount) {
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

在putVal 方法中， if (++size > threshold) resize(); 如果 size 大于 threshold (初始是16) 的时候，会resize()，也就是我们说的rehash。扩容主要的开销在于，需要将老的数组中的元素重新放置到新的数组中。

所以 hashMap的预设容量很重要，因为这直接决定了可用桶的大小，桶太多则浪费空间，桶太少，则resize过程影响性能。那实际开发中应该如何设置呢？我们知道：负载因子 * 容量 > 元素数量，所以预先设置的容量需要满足，大于“预估元素数量 / 负载因子”，同时它是 2 的幂数。负载因子的话，默认是0.75

3.4 为什么hashmap的容量是2的幂数

上面我们知道了hash 是什么，而HashMap 是根据key的hash 值来决策这个值到底放到哪个桶里面。在 putVal 的方法中，我们也可以看到：

if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;

if ((p = tab[i = (n - 1) & hash]) == null)   
    tab[i] = newNode(hash, key, value, null);

n = (tab = resize()).length; n 表示链表的长度， tab[i = (n - 1) & hash] 实际上是计算出，key 在 tab 中的索引位置。如果n 永远都是2 的幂数的话，那么n-1 就永远是以连续 1的形式表示（比如： 00001111=31），所以（n-1）& hash = 保留后x位置1。

例如：10110111 & 00001111 = 00000111（00000111为15），即如果大小为32的话，那hash为10110111的就应该放在15的桶里面。

这样做的好处是：

&运算速度快，至少比%取模运算块
能保证索引值肯定在 capacity 中，不会超出数组长度
(n - 1) & hash，当n为2次幂时，会满足一个公式：(n - 1) & hash = hash % n

那么此时，你可能会问，如果我显示的指定hashmap 初始容量就不为2的幂数呢？会怎么样？

在我们初始化hashmap的时候，如下代码所示，有this.threshold = tableSizeFor(initialCapacity); 这个tableSizeFor方法可以保证n 永远都是2的幂等。

public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }

 /**
     * Returns a power of two size for the given target capacity.
     */
    static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

以上，撸不下去了。?