Python是如何实现dictionary

最新推荐文章于 2024-03-15 15:00:00 发布

huntzw

最新推荐文章于 2024-03-15 15:00:00 发布

阅读量4.3k

点赞数

分类专栏： Python 文章标签： dictionary python object string function equals

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

原文地址： http://www.laurentluce.com/posts/python-dictionary-implementation/

本文描述了Python是如何实现dictionary。dictionary是由主键key索引，可以被看作是关联数组，类似于STL的map。有如下的基本操作。

`1`	`>>> d = {'a': 1,` `'b': 2}`
`2`	`>>> d['c'] = 3`

`3`	`>>> d`
`4`	`{'a': 1,` `'b': 2,` `'c': 3}`

元素的访问方式如下：

`01`	`>>> d['a']`
`02`	`1`

`03`	`>>> d['b']`
`04`	`2`

`05`	`>>> d['c']`
`06`	`3`

`07`	`>>> d['d']`
`08`	`Traceback (most recent call last):`

`09`	`File` `"<stdin>", line 1,` `in` `<module>`
`10`	`KeyError:` `'d'`

如果key不存在，将会返回KeyError的异常。

Hash tables

Python中dictionary是以hash表实现，而hash表以数组表示，数组的索引作为主键key。Hash函数的目标是key均匀的分布于数组中，良好的hash函数能够最小化减少冲突的次数。

在本文中，我们采用string作为key为例子。

`01`	`arguments: string object`
`02`	`returns: hash`

`03`	`function string_hash:`
`04`	`if hash cached:`

`05`	`return it`
`06`	`set len to string's length`

`07`	`initialize var p pointing to 1st char of string object`
`08`	`set x to value pointed by p left shifted by 7 bits`

`09`	`while len >= 0:`
`10`	`set var x to (1000003 * x) xor value pointed by p`

`11`	`increment pointer p`
`12`	`set x to x xor length of string object`

`13`	`cache x as the hash so we don't need to calculate it again`
`14`	`return x as the hash`

如果你执行hash(‘a’)，函数将会执行string_hash()返回12416037344，在此我们假设采用的是64位系统。

假设数组的大小是x，该数组用来存储key/value，我们用掩码x-1计算这个数组的索引。例如，如果数组的大小是8，‘a’的索引是hash(‘a’)&7=0， ‘b’的索引是3， ‘c’的索引是2。’z’的索引是3与’b’的索引一致，由此导致冲突。

我们看到Python中的hash函数在key是连续的时候表现很好的性能，原因是它对于处理这种类型的数据有通用性。但是，一旦我们加入了’z’，由于数据的不连贯性产生了冲突。

当产生冲突的时候，我们可以采用链表来存储这对key/value，但是这样增加了查找时间，不再是O(1)的复杂度。在下一节中，我们将要具体描述一下Python中dictionary解决这种冲突的方法。

Open addressing

散列地址方法是解决hash冲突的探测策略。对于’z’，由于索引3已经被占用，所以我们需要寻找一个没有使用的索引，这样添加key/value的时候肯能花费更多的时间，但是查找时间依然是O(1)，这是我们所期望的结果。

这里采用quadratic探测序列来寻找可用的索引，代码如下：

`1`	`i is the current slot index`
`2`	`set perturb to hash`

`3`	`forever loop:`
`4`	`set i to i << 2 + i + perturb + 1`

`5`	`set slot index to i & mask`
`6`	`if slot is free:`

`7`	`return it`
`8`	`right shift perturb by 5 bits`

让我们来看看它是如何工作的，令i=3
3 -> 3 -> 5 -> 5 -> 6 -> 0…

索引5将被用作’z’的索引值，这不是一个很好的例子，因为我们采用了一个大小是8的数组。对于大数组，算法显示它的优势。

出于好奇，我们来看看当数组大小是32，mask是31时，探测序列为
3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 -> 2…

如果你想了解更多，可以察看dictobject.c的源代码。

接下来，让我们看看Python内部的代码是如何实现的？

Dictionary C structures

如下的C语言结构体用来存储一个dictionary条目：key/value对，结构体内有Hash值，key值和value值。PyObject是Python对象的基类。

`1`	`typedef` `struct` `{`
`2`	`Py_ssize_t me_hash;`

`3`	`PyObject *me_key;`
`4`	`PyObject *me_value;`

`5`	`} PyDictEntry;`

如下的结构体代表了一个dictionary。ma_fill是已使用slot与未使用slot之和。当一对key/value被删除了，该slot被标记为未使用。ma_used是已使用slot的数目。ma_mask等于数组大小减一，用来计算slot的索引。ma_table是该数组，ma_smalltable是初始数组大小为8.

`01`	`typedef` `struct` `_dictobject PyDictObject;`
`02`	`struct` `_dictobject {`

`03`	`PyObject_HEAD`
`04`	`Py_ssize_t ma_fill;`

`05`	`Py_ssize_t ma_used;`
`06`	`Py_ssize_t ma_mask;`

`07`	`PyDictEntry *ma_table;`
`08`	`PyDictEntry (ma_lookup)(PyDictObject mp, PyObject key,` `long` `hash);`

`09`	`PyDictEntry ma_smalltable[PyDict_MINSIZE];`
`10`	`};`

Dictionary initialization

当我们第一次创建一个dictionary的时候，将会调用PyDict_New()。我们以伪码表示该函数的主要功能。

`1`	`returns new dictionary object`
`2`	`function PyDict_New:`

`3`	`allocate new dictionary object`
`4`	`clear dictionary's table`

`5`	`set dictionary's number of used slots + dummy slots (ma_fill) to 0`
`6`	`set dictionary's number of active slots (ma_used) to 0`

`7`	`set dictionary's mask (ma_value) to dictionary size - 1 = 7`
`8`	`set dictionary's lookup function to lookdict_string`

`9`	`return allocated dictionary object`

Adding items

当增加一对新的key/value时，将调用PyDict_SetItem()。该函数的参数有dictionary对象的指针和key/value对。它检测key是否为字符串以及计算hash值或者重复使用已经存在的index。Insertdict()用来增加新的key/value，在dictionary的大小被重新调整之后，而且已使用slot的数量超过数组大小的2/3.

为什么是2/3？这样能够保证probing序列能够快速地找到空闲的slot。稍后我们将看看调整数组大小的函数。

`01`	`arguments: dictionary, key, value`
`02`	`returns: 0 if OK or -1`

`03`	`function PyDict_SetItem:`
`04`	`set mp to point to dictionary object`

`05`	`if key's hash cached:`
`06`	`use hash`

`07`	`else:`
`08`	`calculate hash`

`09`	`set n_used to dictionary's number of active slots (ma_used)`
`10`	`call insertdict with dictionary object, key, hash and value`

`11`	`if key/value pair added successfully and capacity over 2/3:`
`12`	`call dictresize to resize dictionary's table`

Insertdict()使用查找函数来找到一个未使用的slot。接下来我们来看看这个函数，lookdict_string()利用hash值和掩码来计算slot的索引。如果它找不到slot索引的key等于hash&mask，它会使用扰码来探测。

`01`	`arguments: dictionary object, key, hash`
`02`	`returns: dictionary entry`

`03`	`function lookdict_string:`
`04`	`calculate slot index based on hash and mask`

`05`	`if slot's key matches or slot's key is not set:`
`06`	`returns slot's entry`

`07`	`if slot's key marked as dummy (was active):`
`08`	`set freeslot to this slot's entry`

`09`	`else:`
`10`	`if slot's hash equals to hash and slot's key equals to key:`

`11`	`return slot's entry`
`12`	`set var freeslot to null`

`13`	`we are here because we couldn't find the key so we start probing`
`14`	`set perturb to hash`

`15`	`forever loop:`
`16`	`set i to i << 2 + i + perturb + 1`

`17`	`calculate slot index based on i and mask`
`18`	`if slot's key is null:`

`19`	`if freeslot is null:`
`20`	`return slot's entry`

`21`	`else:`
`22`	`return freeslot`

`23`	`if slot's key equals to key or slot's hash equals to hash`
`24`	`and slot is not marked as dummy:`

`25`	`return slot's entry`
`26`	`if slot marked as dummy and freeslot is null:`

`27`	`set freeslot to slot's entry`
`28`	`right shift perturb by 5 bits`

我们在dictionary中增加如下的key/value：{‘a’: 1, ‘b’: 2′, ‘z’: 26, ‘y’: 25, ‘c’: 5, ‘x’: 24}。流程如下：

A dictionary structure is allocated with internal table size of 8.

PyDict_SetItem: key = ‘a’, value = 1
- hash = hash(‘a’) = 12416037344
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12416037344 & 7 = 0
    - slot 0 is not used so return it
  - init entry at index 0 with key, value and hash
  - ma_used = 1, ma_fill = 1
PyDict_SetItem: key = ‘b’, value = 2
- hash = hash(‘b’) = 12544037731
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12544037731 & 7 = 3
    - slot 3 is not used so return it
  - init entry at index 3 with key, value and hash
  - ma_used = 2, ma_fill = 2
PyDict_SetItem: key = ‘z’, value = 26
- hash = hash(‘z’) = 15616046971
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15616046971 & 7 = 3
    - slot 3 is used so probe for a different slot: 5 is free
  - init entry at index 5 with key, value and hash
  - ma_used = 3, ma_fill = 3
PyDict_SetItem: key = ‘y’, value = 25
- hash = hash(‘y’) = 15488046584
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15488046584 & 7 = 0
    - slot 0 is used so probe for a different slot: 1 is free
  - init entry at index 1 with key, value and hash
  - ma_used = 4, ma_fill = 4
PyDict_SetItem: key = ‘c’, value = 3
- hash = hash(‘c’) = 12672038114
- insertdict
  - lookdict_string
    - slot index = hash & mask = 12672038114 & 7 = 2
    - slot 2 is free so return it
  - init entry at index 2 with key, value and hash
  - ma_used = 5, ma_fill = 5
PyDict_SetItem: key = ‘x’, value = 24
- hash = hash(‘x’) = 15360046201
- insertdict
  - lookdict_string
    - slot index = hash & mask = 15360046201 & 7 = 1
    - slot 1 is used so probe for a different slot: 7 is free
  - init entry at index 7 with key, value and hash
  - ma_used = 6, ma_fill = 6

This is what we have so far:

由于8个slot中的6已经被使用了即超过了数组容量的2/3，dictresize()被调用来分配更大的内存，该函数同时拷贝旧数据表的数据进入新数据表。

在这个例子中，dictresize()调用时，minused是24也就是4×ma_used。当已使用slot的数目很大的时候(超过50000)minused是2×ma_used。为什么是4倍？这样能够减少调整步骤的数目而且增加稀疏度。

v新数据表的大少应该大于24，它是通过把当前数据表的大小左移一位实现的，直到她大于24.例如 8 -> 16 -> 32.

`01`	`arguments: dictionary object, (2 or 4) * active slots`
`02`	`returns: 0 if OK, -1 otherwise`

`03`	`function dictresize:`
`04`	`calculate new dictionary size:`

`05`	`set var newsize to dictionary size`
`06`	`while newsize less or equal than (2 or 4) * active slots:`

`07`	`set newsize to newsize left shifted by 1 bit`
`08`	`set oldtable to dictionary's table`

`09`	`allocate new dictionary table`
`10`	`set dictionary's mask to newsize - 1`

`11`	`clear dictionary's table`
`12`	`set dictionary's active slots (ma_used) to 0`

`13`	`set var i to dictionary's active + dummy slots (ma_fill)`
`14`	`set dictionary's active + dummy slots (ma_fill) to 0`

`15`	`copy oldtable entries to dictionary's table using new mask`
`16`	`return 0`

17

}

当数据表大小进行调整的时候所发生的：分配了一个大小是32的新数据表

Removing items

函数PyDict_DelItem()用来删除一个条目。计算key的hash值，然后调用查找函数返回这个条目。该条目的key设置为哑元。这个哑元包含了过去使用过的key值但现在还是未使用的状态。

`01`	`arguments: dictionary object, key`
`02`	`returns 0 if OK, -1 otherwise`

`03`	`function PyDict_DelItem:`
`04`	`if key's hash cached:`

`05`	`use hash`
`06`	`else:`

`07`	`calculate hash`
`08`	`look for key in dictionary using hash`

`09`	`if slot not found:`
`10`	`return -1`

`11`	`set slot's key to dummy`
`12`	`set slot's value to null`

`13`	`decrement dictionary active slots`
`14`	`return 0`

如果我要从dictionary中删除key ’c’，将以如下的数组结束。

注意：删除操作并没有改变数组的大小，即使已使用slot的数目远小于slot的总数目。

Thanks

huntzw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录