hashmap源码研读
jdk1.8 hashmap
类注释
/**
* Hash table based implementation of the <tt>Map</tt> interface. This
* implementation provides all of the optional map operations, and permits
* <tt>null</tt> values and the <tt>null</tt> key. (The <tt>HashMap</tt>
* class is roughly equivalent to <tt>Hashtable</tt>, except that it is
* unsynchronized and permits nulls.) This class makes no guarantees as to
* the order of the map; in particular, it does not guarantee that the order
* will remain constant over time.
> map基于hash table实现的、hashmap允许null key和null value
> hashmap等价于hashtable,只是hashmap不是同步和允许有null值
> hashmap不保证有序
*
* <p>This implementation provides constant-time performance for the basic
* operations (<tt>get</tt> and <tt>put</tt>), assuming the hash function
* disperses the elements properly among the buckets. Iteration over
* collection views requires time proportional to the "capacity" of the
* <tt>HashMap</tt> instance (the number of buckets) plus its size (the number
* of key-value mappings). Thus, it's very important not to set the initial
* capacity too high (or the load factor too low) if iteration performance is
* important.
*
> 提供基本操作get函数与put函数
> hash函数用于设置数据散列的位置,将元素散列到buckets中
> buckets:hashmap底层是数组加链表,数组存放元素的位置
> 在集合视图上迭代需要的时间与HashMap实例的“容量”(bucket的数量)加上其大小(键-值映射的数量)成比例。
> 因此不能把初始容量设置得太高或者装载因子设的太低
* <p>An instance of <tt>HashMap</tt> has two parameters that affect its
* performance: <i>initial capacity</i> and <i>load factor</i>. The
* <i>capacity</i> is the number of buckets in the hash table, and the initial
* capacity is simply the capacity at the time the hash table is created. The
* <i>load factor</i> is a measure of how full the hash table is allowed to
* get before its capacity is automatically increased. When the number of
* entries in the hash table exceeds the product of the load factor and the
* current capacity, the hash table is <i>rehashed</i> (that is, internal data
* structures are rebuilt) so that the hash table has approximately twice the
* number of buckets.
*
> 一个hashmap实例有两个参数影响它的表现: 初始容量,加载因子
> 初始容量(initial capacity):哈希表中buckets的数量就是容量,初始容量是一个hashmap 实例被创建时指定的容量,HashMap<String, Object> map = new HashMap<>(16);
> 加载因子(load factor):一种用来控制当哈希表中所含元素达到多满时才进行扩容的措施
*
* <p>As a general rule, the default load factor (.75) offers a good
* tradeoff between time and space costs. Higher values decrease the
* space overhead but increase the lookup cost (reflected in most of
* the operations of the <tt>HashMap</tt> class, including
* <tt>get</tt> and <tt>put</tt>). The expected number of entries in
* the map and its load factor should be taken into account when
* setting its initial capacity, so as to minimize the number of
* rehash operations. If the initial capacity is greater than the
* maximum number of entries divided by the load factor, no rehash
* operations will ever occur.
*
> 默认的负载因子(0.75)提供了一个很好的时间和空间成本之间的权衡(权衡) <文末具体介绍>
> 负载因子值设的太高会增加查找的成本
> 在设置其初始容量时,应考虑Map中的预期KV数及其装载因子,以最大限度地减少重新计算散列操作
> 如果初始容量大于最大KV数除以加载因子,则不会发生任何重新计算散列操作。
* <p>If many mappings are to be stored in a <tt>HashMap</tt>
* instance, creating it with a sufficiently large capacity will allow
* the mappings to be stored more efficiently than letting it perform
* automatic rehashing as needed to grow the table. Note that using
* many keys with the same {@code hashCode()} is a sure way to slow
* down performance of any hash table. To ameliorate impact, when keys
* are {@link Comparable}, this class may use comparison order among
* keys to help break ties.
*
* <p><strong>Note that this implementation is not synchronized.</strong>
* If multiple threads access a hash map concurrently, and at least one of
* the threads modifies the map structurally, it <i>must</i> be
* synchronized externally. (A structural modification is any operation
* that adds or deletes one or more mappings; merely changing the value
* associated with a key that an instance already contains is not a
* structural modification.) This is typically accomplished by
* synchronizing on some object that naturally encapsulates the map.
*
> hashmap不是线程安全的,多线程操作同一个map时,方法加上synchronized
* If no such object exists, the map should be "wrapped" using the
* {@link Collections#synchronizedMap Collections.synchronizedMap}
* method. This is best done at creation time, to prevent accidental
* unsynchronized access to the map:<pre>
* Map m = Collections.synchronizedMap(new HashMap(...));</pre>
> Map m = Collections.synchronizedMap(new HashMap(...));
> 让你创建的new HashMap()支持多线程数据的同步。保证多线程访问数据的一致性
* <p>The iterators returned by all of this class's "collection view methods"
* are <i>fail-fast</i>: if the map is structurally modified at any time after
* the iterator is created, in any way except through the iterator's own
* <tt>remove</tt> method, the iterator will throw a
* {@link ConcurrentModificationException}. Thus, in the face of concurrent
* modification, the iterator fails quickly and cleanly, rather than risking
* arbitrary, non-deterministic behavior at an undetermined time in the
* future.
> 除非用iterator的remove方法,不然如果在迭代过程中map有结构上的修改就会报错ConcurrentModificationException
*
* <p>Note that the fail-fast behavior of an iterator cannot be guaranteed
* as it is, generally speaking, impossible to make any hard guarantees in the
* presence of unsynchronized concurrent modification. Fail-fast iterators
* throw <tt>ConcurrentModificationException</tt> on a best-effort basis.
* Therefore, it would be wrong to write a program that depended on this
* exception for its correctness: <i>the fail-fast behavior of iterators
* should be used only to detect bugs.</i>
*/
正文
// jdk1.8 hashmap底层采用数组+链表+红黑树
public class HashMap<K,V> extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable {
private static final long serialVersionUID = 362498820763181265L;
/*
* 此处注释截取一段
* Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million
*
> 0.75作为加载因子,每个碰撞位置的链表长度超过8个是几乎不可能的
*/
/**
* The default initial capacity - MUST be a power of two.
* 默认初始化map的容量
*/
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
* 最大容量,如果两个构造函数中的任何一个使用参数隐式指定了更高的值,则使用该值(2^30)
*/
static final int MAXIMUM_CAPACITY = 1 << 30;
/**
* The load factor used when none specified in constructor.
* 默认的填充因子:0.75,能较好的平衡时间与空间的消耗
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;
/**
* The bin count threshold for using a tree rather than list for a
* bin. Bins are converted to trees when adding an element to a
* bin with at least this many nodes. The value must be greater
* than 2 and should be at least 8 to mesh with assumptions in
* tree removal about conversion back to plain bins upon
* shrinkage.
* 将链表(桶)转化成红黑树的阈值
*/
static final int TREEIFY_THRESHOLD = 8;
/**
* The bin count threshold for untreeifying a (split) bin during a
* resize operation. Should be less than TREEIFY_THRESHOLD, and at
* most 6 to mesh with shrinkage detection under removal.
* 在哈希表扩容时,如果发现链表长度小于 6,则会由树重新退化为链表
*/
static final int UNTREEIFY_THRESHOLD = 6;
/**
* The smallest table capacity for which bins may be treeified.
* (Otherwise the table is resized if too many nodes in a bin.)
* Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
* between resizing and treeification thresholds.
* 只有键值对数量大于 64 才会发生转换。
* 这是为了避免在哈希表建立初期,多个键值对恰好被放入了同一个链表中而导致不必要的转化。
*/
static final int MIN_TREEIFY_CAPACITY = 64;
}
put方法 putVal()执行流程
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,boolean evict){……}
get方法 getNode()执行流程
final Node<K,V> getNode(int hash, Object key){……}
扩容方法 resize()执行流程
一、为什么单链表复制到新节点数组时,hash值与上旧节点数组容量等于0可以判断值在新节点数组中下标是否变化?变化了多少?
源码中根据(e.hash & oldCap) == 0)判断e在新节点数组中下标是否变化,这里假设几个常量:
旧节点数组容量oldCap:2^4=16;e节点key hash值(取后8位,前24位为0):11010011
通过hash&(oldCap-1)计算可以得:e在旧节点数组中下标为3
计算过程: 11010011(hash值)
& 1111(oldCap-1二进制值)
——————————————
0011=3(十进制)
通过hash&(oldCap)计算可以得:e在新节点数组下标变换了
计算过程: 11010011(hash值)
& 10000(oldCap二进制值)
——————————————
10000=1(十进制)
通过hash&(oldCap<<1-1)计算可以得:e在旧节点数组中下标为19
计算过程:newCap = oldCap<<1 = 16 * 2 - 1= 31(11111)
11010011(hash值)
& 11111(oldCap二进制值)
——————————————
10011=19(十进制)
再假设其它条件不变,e的hash值为11000011,计算同上。可得hash&(oldCap) = 0;hash&(oldCap<<1-1) = 3。
通过上面例子,可以知道(e.hash & oldCap) == 0)可以判断e在新节点数组中下标是否变化,且变化的幅度为(旧数组下标 + 旧数组容量)
面试题
HashMap加载因子为什么是0.75?
- 加载因子(负载因子、装载因子)衡量hash表的空间使用程度,加载因子越大hash表装载越多
- hashmap使用链表法,查找时间复杂度平均为O(1+n),因此如果负载因子越大,对空间的利用更充分,然而后果是查找效率的降低;如果负载因子太小,那么散列表的数据将过于稀疏,对空间造成严重浪费
- HashMap在时间和空间两者间折中选择了0.75
为什么链表长度达到8就上升到红黑树
有源码注释可以了解,hashmap加载因子是0.75,每个桶存入数据的概率是0.5
根据泊松分布公式:(exp(-0.5)*power(0.5,k)/factorial(K))
Math.exp(-0.5) * Math.pow(0.5, k) / IntMath.factorial(k)
散列性好的hash算法:桶中节点越多,出现下个节点概率越低
- 1个桶中出现1个节点的概率:0.3032653299
- 1个桶中出现2个节点的概率:0.0758163325
- 1个桶中出现3个节点的概率:0.0126360554
- 1个桶中出现4个节点的概率:0.0015795069
- 1个桶中出现5个节点的概率:0.0001579507
- 1个桶中出现6个节点的概率:0.0000131626
- 1个桶中出现7个节点的概率:0.0000009402
- 1个桶中出现8个节点的概率:0.0000000588 // 亿分之六
- 1个桶中出现9个节点的概率:0.0000000033
从数据可以看出桶中出现8个节点概率为亿分之六,不到千万分之一,也就是说用0.75作为加载因子,每个碰撞位置的链表长度超过8个是几乎不可能的,在这个情况下不会基本出现树华操作
散列性不好的hash算法:链表长度超过 8 就转为红黑树的设计,更多的是为了防止用户自己实现了不好的哈希算法时导致链表过长,从而导致查询效率低
有人可能会说,还是没讲清楚为什么用8
链表时间复杂度:O(N) 红黑树时间复杂度:O(log(N))
你假想一下,你认为写个好的散列函数,根据泊松分布可以得知存放第8位数据概率就非常小了,然后好的散列函数让数据分布均匀加上存入hashmap得数据不多,你是不是就不会想去链表转红黑树提高查询效率,毕竟时间复杂度在8之前,两者没多大差距
- 好的hash函数,单个桶存放数据大概率不超过8位
- 坏的hash函数,单个桶存放数据超过8位,就树化(此时转为红黑树更多的是一种保底策略,用来保证极端情况下查询的效率)
为什么链表长度达到6就红黑树变链表
过渡,避免红黑树和链表进行频繁切换
modCount作用?为什么设计它?
/**
* The number of times this HashMap has been structurally modified
* Structural modifications are those that change the number of mappings in
* the HashMap or otherwise modify its internal structure (e.g.,
* rehash). This field is used to make iterators on Collection-views of
* the HashMap fail-fast. (See ConcurrentModificationException).
*/
transient int modCount;
通过翻译注释可以了解到:
- 记录修改结构的次数
- 结构修改是指那些更改或修改其内部结构(例如,HashMap的put新值、remove、重新哈希……;ArrayList的add、remove……)
- 用于迭代器的快速失败策略(fail-fast)
作用:防止在迭代过程中某些原因改变了原集合,导致出现不可预料的情况,从而抛出并发修改异常。每次迭代器中next获取下个元素前会比较迭代器实例中expectedModCount与集合中modCount比较是否相等,不等就会抛出ConcurrentModificationException。
public static void main(String[] args) {
HashMap<String, Integer> map=new HashMap<>(8);
map.put("1", 1);
map.put("2", 2);
map.put("3", 3);
// iterator expectedModCount 3
// iterator1 expectedModCount 4
// iterator2 expectedModCount 5
for(Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator(); iterator.hasNext();){
Map.Entry<String, Integer> next = iterator.next();
if ("1".equals(next.getKey())) {
map.remove("1");
Iterator<Map.Entry<String, Integer>> iterator1 = map.entrySet().iterator();
map.put("4", 4);
Iterator<Map.Entry<String, Integer>> iterator2 = map.entrySet().iterator();
}
}
}
这里会抛出ConcurrentModificationException异常,因为iterator中expectedModCount是3,而map中modCount是5,所有不相等