解读阿里巴巴Java手册：为什么HashMap初始化需要设定大小，HashMap初始化大小设定多少合适

csdn_life18

已于 2024-05-30 22:46:09 修改

阅读量110

点赞数 1

分类专栏：面试知识文章标签： java 哈希算法散列表

于 2024-05-30 11:02:31 首次发布

原文链接：https://blog.csdn.net/ren365880/article/details/108083998

版权

面试知识专栏收录该内容

37 篇文章 4 订阅

订阅专栏

HashMap的介绍

在开始之前，先看下在官方文档中是如何介绍HashMap的：

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

翻译过来就是：
HashMap的实例有两个影响其性能的参数:初始容量和装载因子。容量是哈希表中的桶数，初始容量就是创建哈希表时的容量。负载因子是一种度量方法，用来衡量在自动增加哈希表的容量之前，哈希表允许达到的满度。当哈希表中的条目数超过负载因子和当前容量的乘积时，哈希表将被重新哈希(即重新构建内部数据结构)，这样哈希表的桶数大约是原来的两倍。

作为一般规则，默认的负载系数(.75)在时间和空间成本之间提供了一个很好的折衷。较高的值会减少空间开销，但会增加查找成本(反映在HashMap类的大多数操作中，包括get和put)。在设置初始容量时，应该考虑映射中的预期条目数及其装载因子，以便最小化重散列操作的数量。如果初始容量大于最大条目数除以装载因子，则不会发生重新散列操作。

上面的说法总结一下就是：HashMap的扩容条件就是当HashMap中的元素个数（size）超过临界值（threshold）时就会自动扩容。在HashMap中，threshold = loadFactor * capacity。扩容时新的capacity *= 2。

代码说明初始化的好处

通过上面的说明可以看出，初始容量是影响性能的一个方面，通过代码来直观的感受下：

import java.util.HashMap;
import java.util.Map;

public class Test {

	public static void main(String[] args) {
		int num = 100000;
		//未初始化大小
		Map<Integer, Integer> map1 = new HashMap<Integer, Integer>();
		long s1 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map1.put(i, i);
		}
		long e1 = System.currentTimeMillis();
		System.out.println("未初始化大小："+ (e1 - s1));
		
		//初始化一半大小
		Map<Integer, Integer> map2 = new HashMap<Integer, Integer>(num/2);
		long s2 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map2.put(i, i);
		}
		long e2 = System.currentTimeMillis();
		System.out.println("初始化一半大小："+ (e2 - s2));
		
		//初始化一样大小
		Map<Integer, Integer> map3 = new HashMap<Integer, Integer>(num);
		long s3 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map3.put(i, i);
		}
		long e3 = System.currentTimeMillis();
		System.out.println("初始化一样大小："+ (e3 - s3));
	
		
	}
	
}

得到的结果：

未初始化大小：16
初始化一半大小：12
初始化一样大小：8

可以看出，HashMap初始化时合理的大小设置是能够提升性能的。但这样就可以了吗？为什么说扩容因子也是影响性能的一个方面。

HashMap中的负载(扩容)因子

当我们使用HashMap(int initialCapacity)来初始化容量的时候，HashMap并不会使用我们传进来的initialCapacity直接作为初识容量，JDK会默认帮我们计算一个相对合理的值当做初始容量。所谓合理值，其实是找到第一个比用户传入的值大的2的幂。如用户传入的是7，第一个比7大的2的幂是2的3次方8，所以初始化容量就为8。

如果在新建HashMap时，我们已知容量为7个，传入8就可以了吗？这个值看似合理，实际上并不尽然。因为HashMap在根据用户传入的capacity计算得到的默认容量，并没有考虑到loadFactor这个因素，只是简单机械的计算出第一个大约这个数字的2的幂。

loadFactor是负载因子，当HashMap中的元素个数（size）超过 threshold = loadFactor * capacity时，就会进行扩容。

也就是说，如果我们设置的默认值是7，经过JDK处理之后，HashMap的容量会被设置成8，但是，这个HashMap在元素个数达到 8*0.75 = 6的时候就会进行一次扩容，这明显是我们不希望见到的。

那么，到底设置成什么值比较合理呢？

这个值的计算方法就是：

return (int) ((float) expectedSize / 0.75F + 1.0F);

代码验证设定的大小是否合适

public class Test {

	public static void main(String[] args) {
		int num = 100000;
		//未初始化大小
		Map<Integer, Integer> map1 = new HashMap<Integer, Integer>();
		long s1 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map1.put(i, i);
		}
		long e1 = System.currentTimeMillis();
		System.out.println("未初始化大小："+ (e1 - s1));
		
		//初始化一半大小
		Map<Integer, Integer> map2 = new HashMap<Integer, Integer>(num/2);
		long s2 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map2.put(i, i);
		}
		long e2 = System.currentTimeMillis();
		System.out.println("初始化一半大小："+ (e2 - s2));
		
		//初始化一样大小
		Map<Integer, Integer> map3 = new HashMap<Integer, Integer>(num);
		long s3 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map3.put(i, i);
		}
		long e3 = System.currentTimeMillis();
		System.out.println("初始化一样大小："+ (e3 - s3));
		
		//初始化大小考虑到扩容因子
		Map<Integer, Integer> map4 = new HashMap<Integer, Integer>((int)(num/0.75+1.0));
		long s4 = System.currentTimeMillis();
		for (int i = 0; i < num; i++) {
			map4.put(i, i);
		}
		long e4 = System.currentTimeMillis();
		System.out.println("初始化大小考虑到扩容因子："+ (e4 - s4));
		
	}
	
}

结果

未初始化大小：16
初始化一半大小：12
初始化一样大小：8
初始化大小考虑到扩容因子：4

为什么 HashMap 初始化需要设定大小？

性能优化：
- HashMap 使用哈希表来存储键值对。哈希表的一个核心概念是负载因子（load factor），即填充程度。默认的负载因子是 0.75，这意味着当哈希表填充达到 75% 时，它将进行扩容。设定初始大小可以减少扩容的次数，从而提高性能，因为每次扩容都会涉及到重新散列（rehash）所有现有的键值对，这是一个开销很大的操作。
避免不必要的扩容：
- 如果初始大小设置得过小，HashMap 在插入大量数据时会频繁扩容，每次扩容都会创建一个新的更大的数组，并将所有旧的键值对重新散列到新的数组中。这不仅增加了时间复杂度，还会造成内存碎片。因此，合理设置初始大小可以避免频繁的扩容操作。
减少内存开销：
- 初始化大小过大则会浪费内存，因为未使用的哈希桶（bucket）也会占用空间。通过合理设定初始大小，可以更好地平衡内存使用和性能需求。

HashMap 初始化大小设定多少合适？

根据预期容量估算：
- 一个常见的策略是根据预期的最大条目数和负载因子来设定初始大小。公式如下：
```
int initialCapacity = (int) (expectedMaxEntries / loadFactor) + 1;
```
  例如，如果预计存储 150 个条目，并且使用默认的负载因子 0.75，则初始容量应设为：
```
int initialCapacity = (int) (150 / 0.75) + 1 = 201;
```
  这样可以确保 HashMap 在插入这些条目时尽可能避免扩容。
调整到最近的2的幂次：
- HashMap 的容量最好是 2 的幂次，这样有助于优化哈希分布。因此，设置初始大小时可以调整到大于或等于计算值的最小的 2 的幂。例如，如果计算得出初始容量为 201，则应将其调整为 256 。
经验法则：
- 如果无法精确预估插入条目数，可以基于经验法则设置初始大小。例如，许多开发者会选择默认的 16 或 32 作为初始容量，具体取决于应用程序的需求和历史数据。
具体参考Java方法：

// 首先使用这个公式计算出大概的数量值
 int initialCapacity = (int) (expectedMaxEntries / loadFactor) + 1;

// 然后本地调用这个方法计算出接近目标数值的 2 的幂次
   static final int tableSizeFor(int cap) {
        final int MAXIMUM_CAPACITY = 1 << 30;
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }