为什么要转换?
因为Map中桶的元素初始化是链表保存的,其查找性能是O(n),而树结构能将查找性能提升到O(log(n))。当链表长度很小的时候,即使遍历,速度也非常快,但是当链表长度不断变长,肯定会对查询性能有一定的影响,所以才需要转成树。
为什么阈值是8?
转换后存储的数据结构TreeNodes占用空间是普通Nodes的两倍,只有当bin包含足够多的节点时才会转成TreeNodes,而是否足够多是由TREEIFY_THRESHOLD的值决定的。
在hashCode离散性很好的情况下,树型bin(桶,即bucket,HashMap中hashCode值一样的元素保存的地方)用到的概率非常小,因为数据均匀分布在每个bin中,几乎不会有bin中链表长度会达到阈值。事实上,在随机hashCode的情况下,在bin中节点的分布频率遵循如下的泊松分布(http://en.wikipedia.org/wiki/Poisson_distribution)。
在扩容阈值为0.75的情况下,(即使因为扩容而方差很大)遵循着参数平均为0.5的泊松分布。忽略方差,按公式
计算,概率如下:
长度 | 概率 |
---|---|
0 | 0.60653066 |
1 | 0.30326533 |
2 | 0.07581633 |
3 | 0.01263606 |
4 | 0.00157952 |
5 | 0.00015795 |
6 | 0.00001316 |
7 | 0.00000094 |
8 | 0.00000006 |
如上,一个bin中链表长度达到8个元素的概率为0.00000006,几乎是不可能事件。
大部分情况下,链表存储能节约存储空间同时有着良好的查找性能;极个别情况下,节点数达到8个,转为红黑树,能获得更好的查找性能,同时因为是个别情况,不需要大量的存储空间。
所以,阈值8是时间和空间的权衡,是根据概率统计决定的。不得不感叹,发展30年的Java每一项改动和优化都是非常严谨和科学的。
附. JDK(1.8.0_45)中的相关注释
HashMap类第174~197行
* Because TreeNodes are about twice the size of regular nodes, we
* use them only when bins contain enough nodes to warrant use
* (see TREEIFY_THRESHOLD). And when they become too small (due to
* removal or resizing) they are converted back to plain bins. In
* usages with well-distributed user hashCodes, tree bins are
* rarely used. Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million
ConcurrentHashMap中第327~349行也有关于此的说法,大同小异。
* The main disadvantage of per-bin locks is that other update
* operations on other nodes in a bin list protected by the same
* lock can stall, for example when user equals() or mapping
* functions take a long time. However, statistically, under
* random hash codes, this is not a common problem. Ideally, the
* frequency of nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average, given the resizing threshold
* of 0.75, although with a large variance because of resizing
* granularity. Ignoring variance, the expected occurrences of
* list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The
* first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million