Collections 浅析（一）

最新推荐文章于 2023-05-24 16:47:28 发布

原创最新推荐文章于 2023-05-24 16:47:28 发布 · 1.2k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Java

Java 专栏收录该内容

7 篇文章

订阅专栏

本文详细探讨了Java集合框架中的HashMap、HashTable、TreeMap及List接口的不同实现类的特点与应用场景，深入分析了fail-fast机制的工作原理及其解决方案，并介绍了CopyOnWriteArrayList的实现方式。

Collections 浅析（一）

先来几个问题？

fail-fast机制是什么？
CAS（Compare And Swap）是啥？
ConcurrentHashMap实现原理？

先从一个熟知的问题进入：HashTable、HashMap的区别。

一、HashMap & HashTable

HashTable线程安全，HashMap线程不安全。（HashMap中的方法没有使用synchronize修饰）。多个线程可以共享一个HashTable，而HashMap如果外部没有做同步的话，是不可以共享的。（ps:Java 5提供了ConcurrentHashMap, 这HashTable的替代品，比HashTable的扩展性好）。HashMap可以通过以下方式进行同步：Map m = Collections.synchronizedMap(hashMap);
HashTable不允许 K、V为null，HashMap允许。
HashTable只有contains method，HashMap抛弃了contains方法，新增containsKey & containsValue方法
HashTable**继承Dictionary类，实现Map接口和Serializable接口（据说是java4重写改成这样的，具体的也不得而知了）；HashMap继承AbstractMap类，也实现了Map接口和Seriablizable接口，同时实现了Cloneable接口**。AbstractMap是实现Map接口的类。

ps: 网上看到有人说，HashMap的迭代器Iterator支持fail-fast，HashTable的不支持。然而，我在HashTable的源码中看到了也存在modCount这一参数。再说，HashTable本身方法都是有synchronize修饰的，线程安全的，也不需要这个机制。。。但是方法中还真的有modCount，还真判断了modCount == mc ,不等于就跑fail-fast异常。

    /**
     * The number of times this HashMap has been structurally modified
     * Structural modifications are those that change the number of mappings in
     * the HashMap or otherwise modify its internal structure (e.g.,
     * rehash).  This field is used to make iterators on Collection-views of
     * the HashMap fail-fast.  (See ConcurrentModificationException).
     */
    transient int modCount;

    // ** some other code, e.g. one example

    @SuppressWarnings("unchecked")
    @Override
    public synchronized void forEach(BiConsumer<? super K, ? super V> action) {
        Objects.requireNonNull(action);     // explicit check required in case
                                            // table is empty.
        final int expectedModCount = modCount;

        Entry<?, ?>[] tab = table;
        for (Entry<?, ?> entry : tab) {
            while (entry != null) {
                action.accept((K)entry.key, (V)entry.value);
                entry = entry.next;

                if (expectedModCount != modCount) {
                    throw new ConcurrentModificationException();
                }
            }
        }
    }

那就说明，源码中涉及操作存在：modCount++，有些操作，都做了expectedModCount != modCount 这样的判断，说明HashTable也是支持fail-fast的。（有人说是jdk6做了修改，使其支持fail-fast了，具体的也不清楚）

关于modCount

1、HashTable

首先：我的jdk版本是1.8.0_91

HashTable中，modCount++的地方有：rehash(), addEntry()(Put操作时), remove(), clear(), compute, merger等。HashTable的内部类EntrySet, Enumerator（迭代器）中涉及修改操作HashTable对象的地方，也都同时modCount++了，

clone()方法中将新生成的HashTable对象的：modCount=0

那又如何触发fail-fast呢？看下面例子：

    @SuppressWarnings("unchecked")
    @Override
    public synchronized void forEach(BiConsumer<? super K, ? super V> action) {
        Objects.requireNonNull(action);     // explicit check required in case
                                            // table is empty.
        final int expectedModCount = modCount;

        Entry<?, ?>[] tab = table;
        for (Entry<?, ?> entry : tab) {
            while (entry != null) {
                action.accept((K)entry.key, (V)entry.value);
                entry = entry.next;

                if (expectedModCount != modCount) {
                    throw new ConcurrentModificationException();
                }
            }
        }
    }

在遍历前，先定义一个final值expectedModCount，等于当前、此刻的modCount，而在遍历HashTable时，每次都会检查，此时这个对象的modCount是否等于，遍历操作前的modCount，如果不等于，说明有别的并发线程操作了这个对象，这样就会抛ConcurrentModificationException异常了。

其他需要做fail-fast的地方的实现与上面相同。再看Enumerator迭代器是如何做的。

        /**
         * The modCount value that the iterator believes that the backing
         * Hashtable should have.  If this expectation is violated, the iterator
         * has detected concurrent modification.
         */
        protected int expectedModCount = modCount;

HashTable 的内部类
private class Enumerator implements Enumeration, Iterator中有一个protected的成员 expectedModCount 等于当前的modCount, 迭代器中对HashTable对象做修改 modCount的值也会自增一，迭代器中遍历时，会判断expectedModCount是否等于modCount。

2、HashMap

HashMap中，modCount++的地方有：putVal(), removeNode(), clear(), compute(), merger(), HashIterator等等。

触发fail-fast的地方：KeySet内部类中的forEach()， Values中的forEach(), EntrySet中的forEach(), HashIterator等等。

原理与上面一样，就是涉及修改操作对象结构的地方，modCount++, 遍历，replace等地方，判断前后是否一致，不一致说明有其他并发线程修改了，抛出ConcurrentModificationException异常。

TreeMap

TreeMap是一个通过红黑树实现的有序key-value集合；
TreeMap继承AbstractMap, 也即实现了Map，它是一个集合；
TreeMap实现了NavigableMap接口，它支持一系列的导航方法；
TreeMap实现了Cloneable接口，它可以被克隆。

TreeMap基于红黑树（Red-Black tree）实现。映射根据其键的自然顺序进行排序，或者根据创建映射时提供的Comparator进行排序，具体取决于使用的构造方法。TreeMap的基本操作containsKey, get, put, remove方法，它的时间复杂度是log(n).

TreeMap是非同步的。

TreeMap本质是红黑树，包含几个重要的成员变量：root, size, comparator。其中root是红黑树的根节点。它是Entry类型，Entry是红黑树的节点，它包含了红黑树的6个基本组成：key, value, left, right, parent和color。Entry节点根据key排序，包含的内容是value。Entry中的key比较大小是根据比较器comparator来进行判断的。size是红黑树的节点个数。

先了解一下什么是红黑树：

红黑树

红黑树又称红-黑二叉树，是一棵自平衡的排序二叉树。

平衡二叉树，需要满足：树中任何节点的值大于它的左子节点，且小于它的右子节点。这样使得树的检索效率大大提高。为了维持二叉树的平衡，有很多算法：AVL、SBT、Red-Black Tree、伸展树等（现在看名字我还知道，具体的算法实现好像貌似都还给老师了，AVL还有点印象o(╯□╰)o老师，我对不起你啊！）

平衡二叉树必须具备如下特性：它是一棵空树或它的左右两个子树的高度差的绝对值不超过1，并且左右两个子树都是一棵平衡二叉树。

红黑树故名思义就是节点是红色或者黑色的平衡二叉树，它通过颜色的约束来维持着二叉树的平衡。对于一棵有效的红黑树而言，增加了如下规则：
1. 每个节点都只能是红色或者黑色
2. 根节点是黑色
3. 每个节点（Nil节点，空姐点）是黑色的
4. 如果一个节点是红色的，则它的两个子节点都是黑的。也就是说在一条路径上不能出现相邻的两个红色节点。
5. 从任一节点到每个叶子的所有路径都包含相同数目的黑色节点

插入红黑树图片

TreeMap源码(jdk1.8)

TreeMap定义：

public class TreeMap<K,V>
    extends AbstractMap<K,V>
    implements NavigableMap<K,V>, Cloneable, java.io.Serializable

TreeMap重要属性的定义：

    /**
     * The comparator used to maintain order in this tree map, or
     * null if it uses the natural ordering of its keys.
     *
     * @serial
     */
    private final Comparator<? super K> comparator;

    private transient Entry<K,V> root;

    /**
     * The number of entries in the tree
     */
    private transient int size = 0;

    /**
     * The number of structural modifications to the tree.
     */
    private transient int modCount = 0;

    // Red-black mechanics
    private static final boolean RED   = false;
    private static final boolean BLACK = true;

comparator 比较器，用来给TreeMap排序
root 红黑树根节点
size 红黑树的节点总数
modCount 和上面HastTable HashMap一样，TreeMap的修改次数，用来实现fail-fast的。
RED & BLACK 红黑树的颜色

对于叶子节点Entry(root的类型也是Entry

    /**
     * Node in the Tree.  Doubles as a means to pass key-value pairs back to
     * user (see Map.Entry).
     */
    static final class Entry<K,V> implements Map.Entry<K,V> {
        K key;   // 键
        V value;        //  值
        Entry<K,V> left;    // 左孩子
        Entry<K,V> right;   // 右孩子
        Entry<K,V> parent;      // 父亲
        boolean color = BLACK;  // 颜色

        /**
        * methods
        */
    }

其他的源码解读参考下面这篇文章：

TreeMap实现原理

Java TreeMap源码解析

TreeMap使用场景

TreeMap就是一个Map, 也是以key-value的形式存储数据。TreeMap通常比HashMap、HashTable要慢（尤其是在插入、删除key-value时更慢），因为TreeMap底层采用红黑树来管理键值对。不过TreeMap的好处是：TreeMap中的key-value对总是处于有序状态，无须专门进行排序操作。并且虽然TreeMap在插入和删除方面性能比较差，但是在分类处理的时候作用很大，遍历的速度很快。

List接口(ArrayList, LinkedList, Vector)

常用的实现了List接口的有：ArrayList, LinkedList, 还有Vector（不常用，没用过）
最常用的当然是ArrayList

public class ArrayList<E> extends AbstractList<E>
        implements List<E>, RandomAccess, Cloneable, java.io.Serializable

public class LinkedList<E>
    extends AbstractSequentialList<E>
    implements List<E>, Deque<E>, Cloneable, java.io.Serializable

public class Vector<E>
    extends AbstractList<E>
    implements List<E>, RandomAccess, Cloneable, java.io.Serializable

**
ArrayList是基于动态数组的数据结构**

Vertor和ArrayList一样，也是基于动态数组的数据结构，不过它们的区别也很明显。
1. Vertor是强线程安全的，ArrayList是非线程安全的，和(HashTable, HashMap)(StringBuilder, StringBuffer)一样，ArrayList中的所有方向都没有synchronize修饰；而Vertor中的大多数（涉及操作的几乎）都被synchronize修饰，保证线程安全的。
2. 第二点区别是当它们在add时，内存空间不够了，它们的空间动态增长不同。
ArrayList的增加算法如下：int newCapacity = oldCapacity + (oldCapacity >> 1)，这句话看出，它的增长空间是原来空间的一半（oldCapacity>>1）

    /**
     * Increases the capacity to ensure that it can hold at least the
     * number of elements specified by the minimum capacity argument.
     *
     * @param minCapacity the desired minimum capacity
     */
    private void grow(int minCapacity) {
        // overflow-conscious code
        int oldCapacity = elementData.length;
        int newCapacity = oldCapacity + (oldCapacity >> 1);// 增加原来50%的空间
        if (newCapacity - minCapacity < 0)// 如果容器扩容之后还是不够，就直接设置min
            newCapacity = minCapacity;
        if (newCapacity - MAX_ARRAY_SIZE > 0)
            newCapacity = hugeCapacity(minCapacity);
        // minCapacity is usually close to size, so this is a win:
        elementData = Arrays.copyOf(elementData, newCapacity);
    }

Vertor增长为原来的两倍：int newCapacity = oldCapacity + ((capacityIncrement > 0) ?
capacityIncrement : oldCapacity);

    private void grow(int minCapacity) {
        // overflow-conscious code
        int oldCapacity = elementData.length;
        int newCapacity = oldCapacity + ((capacityIncrement > 0) ?
                                         capacityIncrement : oldCapacity);
        if (newCapacity - minCapacity < 0)
            newCapacity = minCapacity;
        if (newCapacity - MAX_ARRAY_SIZE > 0)
            newCapacity = hugeCapacity(minCapacity);
        elementData = Arrays.copyOf(elementData, newCapacity);
    }

注：具体的Vector还有些不同，有一个增长因子，暂时不管。有兴趣的可以看这篇blog

Vector和ArrayList的比较

LinkedList是基于链表的数据结构

这三种数据结构应该是大家最为熟悉的。再次就不多说了。

说了这么多，也都是最基础的关于集合的一些东西（还没有涉及Set）只列举了两种经常使用的List和Map,结合源码，解决了一些曾经知道的知识，但是不知道底层实现的东西（IDEA就是好，都不用装反编译工具，直接出现源码。用这个读源码，兼职不能再方便）

其实，主要想说的还是fail-fast机制以及对应的解决这个问题的解决方案；那么下面重点来了。

三、fail-fast机制

fail-fast机制，在上面已经提到多次，这些集合的实现类中都有一个feild——modCount用来标记这个集合对象的修改次数。
fail-fast它是java集合的一种错误检测机制。某个线程在对Collection进行迭代时，不允许其他线程对该Collection进行结构上的修改。下面以java为例

 * <p><a name="fail-fast">
 * The iterators returned by this class's {@link #iterator() iterator} and
 * {@link #listIterator(int) listIterator} methods are <em>fail-fast</em>:</a>
 * if the list is structurally modified at any time after the iterator is
 * created, in any way except through the iterator's own
 * {@link ListIterator#remove() remove} or
 * {@link ListIterator#add(Object) add} methods, the iterator will throw a
 * {@link ConcurrentModificationException}.  Thus, in the face of
 * concurrent modification, the iterator fails quickly and cleanly, rather
 * than risking arbitrary, non-deterministic behavior at an undetermined
 * time in the future.
 *
 * <p>Note that the fail-fast behavior of an iterator cannot be guaranteed
 * as it is, generally speaking, impossible to make any hard guarantees in the
 * presence of unsynchronized concurrent modification.  Fail-fast iterators
 * throw {@code ConcurrentModificationException} on a best-effort basis.
 * Therefore, it would be wrong to write a program that depended on this
 * exception for its correctness:  <i>the fail-fast behavior of iterators
 * should be used only to detect bugs.</i>

迭代器的fail-fast行为无法得到保证，一般来说，不可能对是否出现不同步并发修改做出任何硬件保证。fail-fast迭代器会尽最大努力抛出ConcurrentModificationException。因此，在写程序时依赖这一异常来保证迭代器的正确性是错误的：迭代器的快速失败行为应该仅仅用于检测bug。

HashMap的类注释上也有此段类似的注释。

ArrayList中（HashMap等其它有此机制的集合）都有一个field： modCount，在每次对这个ArrayList这个对象做结构上的修改时，如Add，Remove， Clear什么的，modCount都会自增一。然后在迭代时，迭代前，会
int expectedModCount = modCount (HashMap中是： int mc = modCount ��其实都一样)，
然后在每次遍历迭代时，都会比较 expectedModCount是否等于modCount, 如果不等于，就抛出ConcurrentModificationException异常。这就实现了fail-fast机制了。

解决方案

在并发，多线程的情况下，为了避免出现fail-fast导致程序异常。可以使用下面两种方案：

方案一：在遍历的过程中所有涉及到改变modCount值得地方全部加上synchronized或者直接使用Collections.synchronizedList，这样就可以解决。（但是不推荐，增删造成的同步锁可能会阻塞遍历操作）
方案二：使用CopyOnWriteArrayList来替换ArrayList。（推荐）

CopyOnWriteArrayList是ArrayList的一个线程安全的变体。
其中所有可变的操作（add，set）等都是通过对底层数组进行一次新的复制来实现的。所以产生的开销比较大。不过在下面两种情况下很适用
1. 在不能活不想进行同步遍历，但又需要从并发线程中排除冲突时。
2. 当遍历操作的数量大大超过可变操作的数量时。

public class CopyOnWriteArrayList<E>
    implements List<E>, RandomAccess, Cloneable, java.io.Serializable {
    private static final long serialVersionUID = 8673264195747942595L;

    /** The lock protecting all mutators */
    final transient ReentrantLock lock = new ReentrantLock();

    // .......
}

补充说明一下：
1. CopyOnWriteArrayList在数据结构、定义都和ArrayList一样。都是事先List接口，底层使用动态数组实现，方法上也一样。
2. CopyOnWriteArrayList不会产生ConcurrentModificationException异常。压根没有modCount这个东西。

CopyOnWriteArrayList解决fail-fast原理（以add()方法为例）：

ArrayList的add()方法：

    /**
     * Appends the specified element to the end of this list.
     *
     * @param e element to be appended to this list
     * @return <tt>true</tt> (as specified by {@link Collection#add})
     */
    public boolean add(E e) {
        ensureCapacityInternal(size + 1);  // Increments modCount!!
        elementData[size++] = e;
        return true;
    }

    private void ensureCapacityInternal(int minCapacity) {
        if (elementData == DEFAULTCAPACITY_EMPTY_ELEMENTDATA) {
            minCapacity = Math.max(DEFAULT_CAPACITY, minCapacity);
        }

        ensureExplicitCapacity(minCapacity);
    }

    private void ensureExplicitCapacity(int minCapacity) {
        modCount++;

        // overflow-conscious code
        if (minCapacity - elementData.length > 0)
            grow(minCapacity);
    }

简单直接的在数组后面增加了一个元素（增加之前检查了空间是否足够，不够扩容，并modCount++）

CopyOnWriteArrayList的add()方法：

    /**
     * Appends the specified element to the end of this list.
     *
     * @param e element to be appended to this list
     * @return {@code true} (as specified by {@link Collection#add})
     */
    public boolean add(E e) {
        final ReentrantLock lock = this.lock;
        lock.lock();
        try {
            Object[] elements = getArray();
            int len = elements.length;
            Object[] newElements = Arrays.copyOf(elements, len + 1);
            newElements[len] = e;
            setArray(newElements);
            return true;
        } finally {
            lock.unlock();
        }
    }

    final Object[] getArray() {
        return array;
    }

    final void setArray(Object[] a) {
        array = a;
    }

首先，没有modCount这个东西。然后很明显这个的实现复杂了很多，又是加锁，又是复制复制原来的对象，然后修改操作都是在新复制的新对象上操作的，这样其他的并发操作这个ArrayList的时候，就不会出现冲突了。然后add完之后，再将这个新的数组塞回去。（代码清晰、明显，一看就懂，就不多废话了）

CopyOnWriteArrayList果然“人如其名”，先Copy，然后操作，然后再给Write回去，在这个add()的上层完全感觉不到。

参考

ConcurrentHashMap

Java集合——ConcurrentHashMap

ConcurrentHashMap的实现原理

Java并发编程之ConcurrentHashMap

聊聊并发（四）——深入分析ConcurrentHashMap

探索jdk8之ConcurrentHashMap 的实现机制

有些复杂，目前只是大概了解了，所以还是先贴上几个我看个博客吧。

注：ConcurrentHashMap中扩容等操作用到了CAS。
so…..

CAS（Compare And Swap）

CAS是现代CPU广泛支持的一种对内存中的共享数据进行操作的一种特殊指令。这个指令会对内存中的共享数据做原子性读写操作。

CAS包含3个操作数：需要读写的内存位置V，进行比较的值A，拟写入的新值B。

当且仅当V的值等于A时，CAS才会通过原子方式用新值B来更新V的值，否则不会执行任何操作。无论位置V的值是否等于A，都将返回V原有的值。（这种变化被称之为比较并设置，无论操作是否成功都会返回）。CAS的含义是：“我认为V的值应该是A，如果是，那么将V的值更新为B，否则不修改并告诉V的实际值为多少”。CAS是一项乐观的技术，它希望能成功地执行更新操作，并且如果有另外一个线程在最近一次检查后更新了该变量，那么CAS能检测到这个错误。（摘自：《Java并发编程实战》 $15.2）

上面那个V, A, B理解起来有点绕。简单的说就是：V是该值在内存中的位置，A是原来的旧值，B是需修改的新值。

e.g.

public class SimulatedCAS {
    @GuardedBy("this")
    private int value;

    public synchronized int get() {
        return value;
    }

    public synchronized int compareAndSwap(int expectedValue, int newValue) {
        int oldValue = value;
        if (oldValue == expectedValue) {
            value = newValue;
        }
        return value;
    }

    public synchronized boolean compareAndSet(int expectedValue, int newValue) {
        return (expectedValue == compareAndSwap(expectedValue, newValue));
    }
}