在Java8及以后的版本中,HashMap引入了红黑树结构,其底层的数据结构从数组+链表变成了数组+链表/红黑树。HashMap桶中添加元素时,若链表个树超过8(且数组元素大于64),链表会转换成红黑树。
红黑树中的TreeNode是链表中的Node所占空间的两倍,虽然红黑树的查找效率为O(logN),要优于链表的O(N),但是当链表长度比较小的时候,即使全部遍历后时间复杂度也不会太高。所以要寻找一种,时间和空间的平衡,即在链表长度达到一个阈值后再转为红黑树。那么为什么HashMap红黑树的阈值为什么是8呢?
首先和hashcode碰撞次数的泊松分布有关。在负载因子0.75(HashMap默认)的情况下,单个hash槽内元素个数为8的概率小于百万分之一,大于8时转为红黑树,小于等于6时转为链表。而原作者在选择链表元素个数时选择了8是根据概率统计而选择的。
源码中的注释:
Because TreeNodes are about twice the size of regular nodes, we use them only when bins contain enough nodes to warrant use (see TREEIFY_THRESHOLD). And when they become too small (due to removal or resizing) they are converted back to plain bins. In usages with well-distributed user hashCodes, tree bins are
rarely used. Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-pow(0.5, k) / factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
more: less than 1 in ten million
之所以是8,是因为Java源码的贡献者在进行大量实验发现,hash碰撞发生8次的概率已经降到了0.00000006,几乎为不可能事件,如果真的碰撞发生了8次,那么这个时候说明由于元素本身和hash函数的原因(用户自己实现hash函数有误),此次操作的hash碰撞的可能性非常大了,后续还有可能会继续发生hash碰撞。所以,这个时候就应该将链表转换为红黑树了。