当我发现Python字典中不同key可以有相同哈希值后——问渠那得清如许

云中君不见

已于 2023-03-23 23:17:50 修改

阅读量522

点赞数

文章标签： python 哈希表

于 2022-11-11 21:29:26 首次发布

本文链接：https://blog.csdn.net/cendrier/article/details/127812746

版权

迷雾重重

我们知道，字典的key是不可重复的。那么Python怎么比较两个key是否相同呢？换句话说，什么情况下才算重复呢？在回答这个问题之前，先来看一段代码。

class Position():
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        return hash((self.x, self.y))
    
    # def __eq__(self, other):
    #     return self.x == other.x and self.y == other.y

我定义了一个Position类，并重写了__hash__()方法。

p1 = Position(1,2)
p2 = Position(1,2)
print(f"Hash values are equal: {hash(p1) == hash(p2)}")

d = {p1:2}
d[p2] = 4
print(f"Length of d is {len(d)}")
print(f"d[p1]={d[p1]}")
print(f"d[p2]={d[p2]}")

for key in d.keys():
    print(f"The memory adress of key is {id(key)}")

print(f"The memory adress of p1 is {id(p1)}")
print(f"The memory adress of p2 is {id(p2)}")

接下来，实例化了两个positionp1和p2，它们的坐标位置相同。
运行得到结果：
在这里插入图片描述

哈希值的比较并不出人意料，因为我们重写了哈希的计算方法。但有趣的是，字典的长度竟然是2，也就是说它同时把p1和p2作为了不同的key；但明明它们的哈希值是相同的啊！

扑朔迷离

如果感到困惑了，先别急，再看一段代码，这次我再定义一个__eq__()方法：

class Position():
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        return hash((self.x, self.y))
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y

好，现在我们运行和上面同样的操作，这次发现结果不一样了：
在这里插入图片描述
这次的字典长度变为了1，也就是说这次Python认为p1和p2是相同的key，只存储了其中一个。进一步研究，发现字典里唯一的key的内存地址是p1的内存地址；并且d[p1]和d[p2]都可以访问唯一的value——4

解惑

带着这些疑惑，我在stack overflow上找到了一篇回答：Why can a Python dict have multiple keys with the same hash?

Python dictionaries are implemented as hash tables;
Each entry in the table is actually a combination of the three values: <hash of key, key, value>;
When a new dict is initialized it starts with 8 slots;
The dict will be resized if it is two-thirds full. This avoids slowing down lookups.

Python的字典的底层实现是哈希表，每个slot存的是 <hash of key, key, value>这么一个三元组。它大概长这个样子，它由很多个slots组成，可以把这些slots想象成抽屉：

# Logical model of Python Hash table
-+-----------------+
0| <hash|key|value>|
-+-----------------+
1|      ...        |
-+-----------------+
.|      ...        |
-+-----------------+
i|      ...        |
-+-----------------+
.|      ...        |
-+-----------------+
n|      ...        |
-+-----------------+

当有键值对要放入字典时，Python首先根据hash of key计算应该把它放在哪个slot（抽屉）里。

If that slot is empty, the entry is added to the slot (by entry, I mean, <hash|key|value>).
If the slot is occupied, CPython (and even PyPy) compares the the hash AND the key (by compare I mean == comparison not the is comparison) of the entry in the slot against the key of the current entry to be inserted (dictobject.c:337,344-345). If both match, then it thinks the entry already exists, gives up and moves on to the next entry to be inserted. If either hash or the key don’t match, it starts probing.

如果这个抽屉是空的，直接把这个键值对放入该抽屉；如果抽屉已经有东西了，Python就会比较key1和key2的哈希值是否一样，并且看key1和key2是否相等（这里的相等是==，而不是is）。如果这两个检验条件都是True，说明key1和key2是同一个key，那么更新原来抽屉里的value

这样就回答了开头的问题，Python如何比较两个key是否相同。

第一节中的例子，p1和p2的哈希值相同，所以Python计算得到，应该把它们放入同一个抽屉。那么就开始比较这两个key是否相同。由于我们没有在类中定义__eq__()方法，Python自动调用父类object的__eq__()方法，即比较两个instance的内存地址（就是is干的事情）。很明显，p1和p2的内存地址不相同，所以Python把它们作为了两个不同的key，尽管它们哈希值是一样的。

再看第二节中的例子，我们重写了__eq__()方法，此时Python判断p1==p2，返回结果是True，另外p1和p2的哈希值也相等；所以它认为p1和p2是同一个key，因此字典长度为1。