python中字典是用哈希表实现的,国外有人写过python字典的前因后果:Python dictionary implementation
我自己粗略的理解:
要想建立一个映射,首先把一个key映射为一个哈希值:
>>> map(hash, (0, 1, 2, 3))
[0, 1, 2, 3]
>>> map(hash, ("namea", "nameb", "namec", "named"))
[-1658398457, -1658398460, -1658398459, -1658398462]
python中对string的映射是这样一个函数:
arguments: string object
returns: hash
function string_hash:
if hash cached:
return it
set len to string's length
initialize var p pointing to 1st char of string object
set x to value pointed by p left shifted by 7 bits
while len >= 0:
set var x to (1000003 * x) xor value pointed by p
increment pointer p
set x to x xor length of string object
cache x as the hash so we don't need to calculate it again
return x as the hash
这段伪代码,我没看太懂,大意就是字符串首位的地址p,进行了移位、异或、乘法等操作,最后得到一个数。结果就是如果给一个“a”,那它的哈希值就是12416037344。
得到的这个哈希值然后再经过一个哈希函数,文中给出的例子是取二进制后三位:
If the size of the array is 8, the index for ‘a’ will be: hash(‘a’) & 7 = 0. The index for ‘b’ is 3, the index for ‘c’ is 2, the index for ‘z’ is 3 which is the same as ‘b’, here we have a collision.
之所以用取后三位的例子,是因为假设keys array长度是8,下面一句话解释了为什么长度是8的时候要和7与。
If an array of size x is used to store the key/value pairs then we use a mask equal to x-1 to calculate the slot index of the pair in the array. This makes the computation of the slot index fast.
上面说“a”的哈希值对应的index是0,我们可以简单地验证一下:
12416037344转成二进制是1011100100000011011011000111100000,与7做与运算的结果的确是0。映射出来的index可以看作是内存地址的偏移量,譬如“a”的index是0,那就把他对应的value——1存放在地址为0的地方,如图:
于是,字典无序的原因便十分明了了。值得说明的是:
对应的value被存放的时候,key也和他放在了一起,这里实际上key起了一个验证的作用,譬如给出一个key,你计算得出他的index以后不能直接就去对应的value,因为可能由于冲突换了地方,而key恰好就可以起到核对作用。
至于冲突的问题:
python中为了保证O(1)的效率,并没有用拉链法解决冲突,而是用的线性探测的开放地址法:
具体的过程,看这两部分:
j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2**i as the next table index;
结果:
Just out of curiosity, let’s look at the probing sequence when the table size is 32 and j = 3.
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…
作者的解释:
i = 3
mask = 31
perturb = hash(‘z’)
while True:
i = (i < < 2) + i + perturb + 1
slot = i & mask
print slot
perturb >>= 5