Hash表的初步认识

最新推荐文章于 2024-08-04 17:16:44 发布

山有梧桐

最新推荐文章于 2024-08-04 17:16:44 发布

阅读量477

点赞数

文章标签： java 开发语言后端

本文链接：https://blog.csdn.net/qq_51194826/article/details/123831418

版权

简介

哈希表(hash table）也叫作散列表,作为数据结构的一种,它的优点在于无论是插入操作还是查找操作,它的时间复杂度是o(1),正是因为这个优点,在海量数据处理的场景都会有它的身影.

这其中的Hash也就是hash值,主要用于信息安全领域的加密算法,它把一些值转换为杂乱的128编码,这些编码值就叫做Hash值,换个方向去看这个Hash值,Hash就是一种数据与数据地址之间的映射关系.

对java源代码有一定的理解的同学，HashSet的底层代码原理是HashMap,而HashMap的底层原理就主要就是HashCode了。在HashMap的底层代码里的第一段注释:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.
*
* Tree bins (i.e., bins whose elements are all TreeNodes) are
* ordered primarily by hashCode, but in the case of ties, if two
* elements are of the same “class C implements Comparable”,
* type then their compareTo method is used for ordering. (We
* conservatively check generic types via reflection to validate
* this – see method comparableClassFor). The added complexity
* of tree bins is worthwhile in providing worst-case O(log n)
* operations when keys either have distinct hashes or are
* orderable, Thus, performance degrades gracefully under
* accidental or malicious usages in which hashCode() methods
* return values that are poorly distributed, as well as those in
* which many keys share a hashCode, so long as they are also
* Comparable. (If neither of these apply, we may waste about a
* factor of two in time and space compared to taking no
* precautions. But the only known cases stem from poor user
* programming practices that are already so slow that this makes
* little difference.)
*
* Because TreeNodes are about twice the size of regular nodes, we
* use them only when bins contain enough nodes to warrant use
* (see TREEIFY_THRESHOLD). And when they become too small (due to
* removal or resizing) they are converted back to plain bins. In
* usages with well-distributed user hashCodes, tree bins are
* rarely used. Ideally, under random hashCodes, the frequency of
* nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average for the default resizing
* threshold of 0.75, although with a large variance because of
* resizing granularity. Ignoring variance, the expected
* occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
* factorial(k)). The first values are:
*
* 0: 0.60653066
* 1: 0.30326533
* 2: 0.07581633
* 3: 0.01263606
* 4: 0.00157952
* 5: 0.00015795
* 6: 0.00001316
* 7: 0.00000094
* 8: 0.00000006
* more: less than 1 in ten million
*
* The root of a tree bin is normally its first node. However,
* sometimes (currently only upon Iterator.remove), the root might
* be elsewhere, but can be recovered following parent links
* (method TreeNode.root()).
*
* All applicable internal methods accept a hash code as an
* argument (as normally supplied from a public method), allowing
* them to call each other without recomputing user hashCodes.
* Most internal methods also accept a “tab” argument, that is
* normally the current table, but may be a new or old one when
* resizing or converting.
*
* When bin lists are treeified, split, or untreeified, we keep
* them in the same relative access/traversal order (i.e., field
* Node.next) to better preserve locality, and to slightly
* simplify handling of splits and traversals that invoke
* iterator.remove. When using comparators on insertion, to keep a
* total ordering (or as close as is required here) across
* rebalancings, we compare classes and identityHashCodes as
* tie-breakers.
*
* The use and transitions among plain vs tree modes is
* complicated by the existence of subclass LinkedHashMap. See
* below for hook methods defined to be invoked upon insertion,
* removal and access that allow LinkedHashMap internals to
* otherwise remain independent of these mechanics. (This also
* requires that a map instance be passed to some utility methods
* that may create new nodes.)
*
* The concurrent-programming-like SSA-based coding style helps
* avoid aliasing errors amid all of the twisty pointer operations.
*/
基于哈希表的 Map 接口的实现。此实现提供所有可选的映射操作，并允许使用 null 值和 null 键。（除了不同步和允许使用 null 之外，HashMap 类与 Hashtable 大致相同。）此类不保证映射的顺序，特别是它不保证该顺序恒久不变。

此实现假定哈希函数将元素正确分布在各桶之间，可为基本操作（get 和 put）提供稳定的性能。迭代集合视图所需的时间与 HashMap 实例的“容量”（桶的数量）及其大小（键-值映射关系数）的和成比例。所以，如果迭代性能很重要，则不要将初始容量设置得太高（或将加载因子设置得太低）。

HashMap 的实例有两个参数影响其性能：初始容量和加载因子。容量是哈希表中桶的数量，初始容量只是哈希表在创建时的容量。加载因子是哈希表在其容量自动增加之前可以达到多满的一种尺度。当哈希表中的条目数超出了加载因子与当前容量的乘积时，通过调用 rehash 方法将容量翻倍。

通常，默认加载因子 (.75) 在时间和空间成本上寻求一种折衷。加载因子过高虽然减少了空间开销，但同时也增加了查询成本（在大多数 HashMap 类的操作中，包括 get 和 put 操作，都反映了这一点）。在设置初始容量时应该考虑到映射中所需的条目数及其加载因子，以便最大限度地降低 rehash 操作次数。如果初始容量大于最大条目数除以加载因子，则不会发生 rehash 操作。

如果很多映射关系要存储在 HashMap 实例中，则相对于按需执行自动的 rehash 操作以增大表的容量来说，使用足够大的初始容量创建它将使得映射关系能更有效地存储。（此段截取于API）

常见的散列法：

（这些方法转载自于自由の翼）
1.除法散列法(取余法)

index=value%16;

这个方法是最直观的方法，缺点就在于除余法虽能一定程度保证词条均匀分布，但从关键码空间到散列地址空间依然残留有一定的连续性，如相邻关键码对应散列地址也相邻，因此就有了mad法
2.MAD法
若常数ab选取得当，可以很好地克服除余法的这种连续性。除余法也可以看作Mad法a=1和b=0的特例，只是两个常数并未发挥实质作用。

hash(key) = (a*key+b) % M;

请添加图片描述
3.伪随机数

 hash（Key）=rand（Key）=[rand(0)pow(a,Key)]%M

4.平方去中法：Key平方后，取中间的几位作为散列地址。为何取中？平方可以看成位移后相加的操作，平方后的数的中间几位，是由Key的更多数位相加构成的，具有更多Key的特性，两侧则相反。所以平方后取中可以尽可能体现Key的特性，使得分布减少冲突。

5.折叠法：将Key折叠切割，如123456789可以将其地址转换为123+456+789=1368。当然还有更多的折叠方式，具体视情况而定。

对于HashCode的底层代码：

 public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

从这代码可以可以发现Hashcode的短板。**hash值是有限的！**在数据多的情况下，就会存在不同值却hash值相同的情况。解决这样的问题主要是这几个方案。
学完这些就算是对Hash的初步认识了！

山有梧桐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hash表的初步认识

简介哈希表(hash table）也叫作散列表,作为数据结构的一种,它的优点在于无论是插入操作还是查找操作,它的时间复杂度是o(1),正是因为这个优点,在海量数据处理的场景都会有它的身影.这其中的Hash也就是hash值,主要用于信息安全领域的加密算法,它把一些值转换为杂乱的128编码,这些编码值就叫做Hash值,换个方向去看这个Hash值,Hash就是一种数据与数据地址之间的映射关系.对java源代码有一定的理解的同学，HashSet的底层代码原理是HashMap,而HashMap的底层原理就主要就
复制链接

扫一扫