HashMap1.8源码(2)

1、默认初始容量为什么必须是2的n次幂?

2、加载因子为什么是0.75?

3、链表树形化的阈值为什么是8?

1、默认初始容量

默认初始容量为1<<4,二进制0001左移4位为10000即16。注释中说到了初始容量必须是2的n次幂。

/**
* The default initial capacity - MUST be a power of two.
*/
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

分析源码中put()方法(下图),是通过(n-1)&hash获取哈希表中桶的数组下标,即通过(capacity-1)&hash来判断数据的存放位置。

而判断把数据存放在哪个桶上,通常用取模运算,即hash%capacity;二进制的位运算(&运算)效率高于取模运算;另外当capacity为2的n次幂时,hash%capacity与(capacity-1)&hash两者是等价的,下图可验证。因此MUST be a power of two,即初始容量为2的n次幂。

        System.out.println(78%16==(78&15));
        System.out.println(5%8==(5&7));
        System.out.println(33%10==(33&9));
        System.out.println(2%32==(2&31));
        System.out.println(4%9==(4&8));

2、最大容量是2^30。

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

数据类型int为4个字节,32位,最高位为符号位,因此最大为1左移30位。

 3、默认加载因子0.75f。

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

加载因子和扩容机制相关。容器的容量达到加载因子对应的容量,则进行扩容操作。

为什么是0.75呢?

 * <p>As a general rule, the default load factor (.75) offers a good
 * tradeoff between time and space costs.  Higher values decrease the
 * space overhead but increase the lookup cost (reflected in most of
 * the operations of the <tt>HashMap</tt> class, including
 * <tt>get</tt> and <tt>put</tt>).  The expected number of entries in
 * the map and its load factor should be taken into account when
 * setting its initial capacity, so as to minimize the number of
 * rehash operations.  If the initial capacity is greater than the
 * maximum number of entries divided by the load factor, no rehash
 * operations will ever occur.

一般来说,默认的加载因子0.75在时间和空间成本上提供了很好的权衡。负载因子更大则减少了空间的开销但是增加了查询成本。从数据结构的空间成本和时间成本出发,0.75是容器空间利用率较高,同时又可以减少hash冲突。

4、链表树形化的阈值为8,树形转链表的阈值为6,最小树形化容量阈值为64。

    /**
     * The bin count threshold for using a tree rather than list for a
     * bin.  Bins are converted to trees when adding an element to a
     * bin with at least this many nodes. The value must be greater
     * than 2 and should be at least 8 to mesh with assumptions in
     * tree removal about conversion back to plain bins upon
     * shrinkage.
     */
    static final int TREEIFY_THRESHOLD = 8;

为什么链表树形化的阈值是8呢?

在源码中有这样的解释,当user hashCodes离散性很好的时候,树形容器用到的概率非常小,理想情况下,哈希表中各个桶上的链表的节点数量呈现泊松分布。

1个桶中出现0个节点的概率:0.60653066
1个桶中出现1个节点的概率:0.30326533
1个桶中出现2个节点的概率:0.07581633
1个桶中出现3个节点的概率:0.01263606
1个桶中出现4个节点的概率:0.00157951
1个桶中出现5个节点的概率:0.00015795
1个桶中出现6个节点的概率:0.00001316
1个桶中出现7个节点的概率:0.00000094
1个桶中出现8个节点的概率:0.00000006
可以看到一个桶中出现 8 个节点的概率不到千万分之一
* Because TreeNodes are about twice the size of regular nodes, we
* use them only when bins contain enough nodes to warrant use
* (see TREEIFY_THRESHOLD). And when they become too small (due to
* removal or resizing) they are converted back to plain bins. In 
* usages with well-distributed user hashCodes, tree bins are
* rarely used.  Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution.
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0:    0.60653066
* 1:    0.30326533
* 2:    0.07581633
* 3:    0.01263606
* 4:    0.00157952
* 5:    0.00015795
* 6:    0.00001316
* 7:    0.00000094
* 8:    0.00000006
* more: less than 1 in ten million

一般情况下,我们的容器里面是不会存储千万级的数据的,所以通常并不会发生从链表向红黑树的转换。然而user hashCodes的算法是用户自己实现的,可能出现节点分布不均匀链表长度过长的情况,这样查询效率将会降低(链表查找性能是O(n)),而此时转换为树形(红黑树查找性能是O(log(n)))可以保证查找性能。

另外当链表长度不是很长时转换为红黑树,其查询性能没有明显的优势,还影响了空间性能。也就是说在节点数比较小的时候,此时对于红黑树来说内存上的劣势会超过查找等操作的优势,自然使用链表更加好。权衡空间和时间,选择了概率不到千万分之一的8作为转换的阈值。

    /**
     * The bin count threshold for untreeifying a (split) bin during a
     * resize operation. Should be less than TREEIFY_THRESHOLD, and at
     * most 6 to mesh with shrinkage detection under removal.
     */
    static final int UNTREEIFY_THRESHOLD = 6;

 为什么选择6作为链表化的阈值呢?

中间还有个差值7可以有效防止链表和树形频繁转换,如果链表节点个数超过8则链表转换成树形结构,链表节点个数小于8则树形结构转换成链表,当一个HashMap不停的插入、删除元素,链表节点个数在8左右变动,就会频繁的发生树形转链表、链表转树形,效率会很低。

    /**
     * The smallest table capacity for which bins may be treeified.
     * (Otherwise the table is resized if too many nodes in a bin.)
     * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
     * between resizing and treeification thresholds.
     */
    static final int MIN_TREEIFY_CAPACITY = 64;

当哈希表的容量大于64时,才允许将链表转换成红黑树,否则,如果容器内节点太多时直接扩容。为了避免进行扩容、树形化选择的冲突,这个值不能小于 4 * TREEIFY_THRESHOLD。

在链表转树形的方法中,判断哈希表的容量是否小于最小树形化容量,是则进行扩容操作。容量低于64时,哈希碰撞的机率比较大,而这个时候出现长链表的可能性会稍微大一些,这种原因下产生的长链表,我们应该优先选择扩容而避免不必要的树化。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值