hashCode方法与31

hash code 定位

一直有个概念就是,hash可以很快存取数据。但是具体的实现从没有深究过。最近想了解自定义hashCode方法,看到书上说到效率问题时,决定探究一下HashMap中hash的定位方式(HashSet内部也是借助HashMap来实现的)。

HashMap的数据存储结构

HashMap中的数据存储在Node数组table(Node[])中,最基本的Node是一个单向链表结构:

static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;
    Node<K,V> next;

    Node(int hash, K key, V value, Node<K,V> next) {
        this.hash = hash;
        this.key = key;
        this.value = value;
        this.next = next;
    }
}

HashMap的数据数据存取,就是将key对应的hash值换算成table数组的索引;相同索引的话将数据挂到单向链表上。

HashMap的索引计算
  1. 首先hash值计算的索引不会越界,如下:
//n是Node数组table的大小
//hash是key计算得到的hash值(注意这里不是Object.hashCode的返回值)
table[(n - 1) & hash]
  1. 上面所说的hash值来源计算如下:
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

这里将高位与低位进行异或操作,权衡了很多东西,包括hash值的分布,table大小的边界问题、位扩展的速度效率等等。官方给出的解释如下:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don’t benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

hashCode() 与 “31”

假如大多数数据的hash值相同的话,每次取值都需要遍历链表,导致效率下降。所以在必要时候应该合理重写hashCode方法。在Objects类里面提供了一个hash方法,它的实现如下:

public static int hash(Object... values) {
    return Arrays.hashCode(values);
}
public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}
像上面的实现一样,经常可以在代码中看到31这个数字,为什么会选择31呢?

目前资料[1]看到的情况如下:
首先它是一个(奇)素数。我们知道hash的值越具有唯一性越好,而素数与其他数相比最有机会获取到唯一性的数据。

Primes are unique numbers. They are unique in that, the product of a prime with any other number has the best chance of being unique (not as unique as the prime itself of-course) due to the fact that a prime is used to compose it. This property is used in hashing functions.

资料[1]提到使用素数可以得到足够唯一的key,但这并非唯一的选择,深入的话需要研究hash:

However using primes is an old technique. The key here to understand that as long as you can generate a sufficiently unique key you can move to other hashing techniques too. Go here for more on this topic about hashes without primes.

还有一部分资料让人不想再深究下去:

资料片段[1]

Researchers found that using a prime of 31 gives a better distribution to the keys, and lesser no of collisions. No one knows why, the last i know and i had this question answered by Chris Torek himself, who is generally credited with coming up with 31 hash, on the C++ or C mailing list a while back.

资料片段[2]

You can read Bloch’s original reasoning under “Comments” in http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. He investigated the performance of different hash functions in regards to the resulting “average chain size” in a hash table. P(31) was one of the common functions during that time which he found in K&R’s book (but even Kernighan and Ritchie couldn’t remember where it came from). In the end he basically had to chose one and so he took P(31) since it seemed to perform well enough. Even though P(33) was not really worse and multiplication by 33 is equally fast to calculate (just a shift by 5 and an addition), he opted for 31 since 33 is not a prime

甚至有人说hash的实现在未来的版本有可能会修改,这不是我这小小程序员需要考虑的事情。个人觉得只需要记住几件事:

  1. 素数作为因子在哈希算法历史上存在很长时间了
  2. 31的选择可能不是最优但它已经足够
  3. 31乘积会被优化,31 * i == (i << 5) - i
  4. 31已经广泛存在于Java的基本库、第三方库的源代码中,已被芸芸众开发者接受

参考

[1] : Why do hash functions use prime numbers?

[2] : stackoverflow - why-does-javas-hashcode-in-string-use-31-as-a-multiplier

[3] : Hash functions.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在Java中,hashCode方法是用于获取对象的哈希码的方法,它的返回值是int类型。默认情况下,hashCode方法返回的哈希码是根据对象的内存地址计算出来的。但是,如果我们在类中重写了hashCode方法,就可以根据我们自己的需求来计算哈希码了。 在重写hashCode方法时,需要遵循以下原则: 1. 如果两个对象的equals方法返回true,则它们的hashCode方法返回值必须相同; 2. 如果两个对象的equals方法返回false,则它们的hashCode方法返回值不一定不同,但是不同的hashCode值能够提高哈希表的性能; 3. hashCode方法不能依赖于对象的内部状态,因为对象的内部状态改变时,hashCode值也会改变。 重写hashCode方法的一般步骤如下: 1. 定义一个int类型的变量result,并初始化为一个非零值,比如17; 2. 将对象中每个重要的域(即影响对象相等性的域)的hashCode值计算出来,并将这些值组合起来,一般使用乘法和加法混合的方式实现。例如,如果对象有两个重要的域a和b,则可以使用result = 31 * result + a.hashCode() + b.hashCode(); 3. 返回result的值。 下面是一个示例,展示了如何重写hashCode方法: ``` public class Person { private String name; private int age; public Person(String name, int age) { this.name = name; this.age = age; } @Override public int hashCode() { int result = 17; result = 31 * result + name.hashCode(); result = 31 * result + age; return result; } // 省略equals方法 } ``` 在这个例子中,我们根据对象中的两个重要域name和age来计算哈希码。我们使用17来初始化result变量,然后将name的hashCode值和age的值分别乘以31后加到result中,最后返回result的值。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值