Python源码分析-PyDictObject

最新推荐文章于 2024-06-17 17:35:06 发布

choumianxian2107

最新推荐文章于 2024-06-17 17:35:06 发布

阅读量169

点赞数

文章标签： python runtime 数据结构与算法

原文链接：https://my.oschina.net/mopidick/blog/751210

版权

目前Cpython使用最多，下面分析下python中字典的源码实现 ###数据结构 ####1. PyDictObject PyDictObject是python字典对应的C对象，本质上是一个hash表基本元素的组合，包含3个元素：

一个table（可以看成是一个数组）
hash函数
表格中的每一项：entry

typedef struct _dictobject PyDictObject;
struct _dictobject {
    PyObject_HEAD
    Py_ssize_t ma_fill;  /* # Active + # Dummy */
    Py_ssize_t ma_used;  /* # Active */

    /* The table contains ma_mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t ma_mask;

    /* ma_table points to ma_smalltable for small tables, else to
     * additional malloc'ed memory.  ma_table is never NULL!  This rule
     * saves repeated runtime null-tests in the workhorse getitem and
     * setitem calls.
     */
    PyDictEntry *ma_table;
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);
    PyDictEntry ma_smalltable[PyDict_MINSIZE];
};

PyDictObject包含了一个PyObject_HEAD, 任何python的对象都含他的指针。PyObject_HEAD包含一个双向链表, 一个引用计算器, 一个对象描述(typeobject)。这个对象其实主要的作用是垃圾回收。
ma_table 和ma_smalltable对应的是hash表中的table，但这里为啥有两个table呢？因为Python源码中使用了大量PyDictOject，但是dict中元素的数量一般比较少，为了方便，每次创建该对象时都会创建Pydict_MINISIZE个entry空间。当table中元素的个数超过一定数量时就会自动调整table的长度。所以，ma_table初始时等于ma_smalltable，当entry个数增加时，会调整 ma_table的长度。
Py_ssize_t ma_mask是用于计算hash值的，它的值等于table的长度减一。这个属性的理解非常重要，直接关系到是否能完全理Python的哈希函数以及hash值的计算。Python字典的哈希函数非常简单，如下：

  ma_mask = len(table) - 1 # table的长度必须是2的N次方，所以ma_mask肯定是奇数
  index = key & ma_mask  #等同于 index = key % len(table) ；  index是表格中的位置，那么 key是怎么来的，这是关键，后续介绍

ma_lookup 函数用于根据 key查找 val。既然hash函数这么简单，那么为什么还需一个特殊的查找函数呢？因为table中的entry不是简单的一个数字或者字符串，而是一个对象PyDictEntry，这个对象有自己的生命周期，所以i在查找时稍微复杂一点。
ma_fill与ma_used：上面说过PyDictEntry有自己的生命周期，包括3个状态：unused，active, dummy。ma_fill表示table中已使用的个数（=active+dummy），active表示当前正在使用的个数，dummy表示插入以后删除的个数。

#code: python
d = {'name': 'wxg', 'age': 23, 'sex': 'male'}   # unused=5(默认Pydict_MINISIZE=8)， active=3, dummy=0
del d['sex'] # unused=5, active=2, dummy=1

####2. PyDictEntry PyDictEntry是table中的具体元素项。

typedef struct {
    /* Cached hash code of me_key.  Note that hash codes are C longs.
     * We have to use Py_ssize_t instead because dict_popitem() abuses
     * me_hash to hold a search finger.
     */
    Py_ssize_t me_hash;
    PyObject *me_key;
    PyObject *me_value;
} PyDictEntry;

me_hash是hash值， me_key是储存的对象（可以是任意类型，因为python中一切皆对象，这些对象都是PyObject），me_value是存储的值。

###hash函数分析理解一个hash表的实现，最重要的是理解其中的hash函数的实现，以及发生碰撞时的解决方法。

####1. hash函数的实现上面介绍过hash函数的实现

  ma_mask = len(table) - 1 # table的长度必须是2的N次方，所以ma_mask肯定是奇数
  index = key & ma_mask  # key是怎么来的，这是关键，后续介绍
  #设 d =  {'name': 'wxg'}
  key = get_key('name')   # 下面介绍 get_key 是怎么实现的。

PyDictObject本身的hash函数很简单，因为key是经过一次hash的值，即get_key函数就是获取一个对象（包括字符串，整数和更复杂对象）的hash值。Python源码中的原型如下：

long
PyObject_Hash(PyObject *v)
{
   PyTypeObject *tp = v->ob_type;
   #1. 获取该对象的类型，然后调用该类型的tp_hash函数获取该对象的hash值
   if (tp->tp_hash != NULL)
       return (*tp->tp_hash)(v);
   /* To keep to the general practice that inheriting
    * solely from object in C code should work without
    * an explicit call to PyType_Ready, we implicitly call
    * PyType_Ready here and then check the tp_hash slot again
    */
   if (tp->tp_dict == NULL) {
       if (PyType_Ready(tp) < 0)
           return -1;
       if (tp->tp_hash != NULL)
           return (*tp->tp_hash)(v);
   }
   #2. 如果该类型没有tp_hash函数，就使用该对象的内存地址作为hash值
   if (tp->tp_compare == NULL && RICHCOMPARE(tp) == NULL) {
       return _Py_HashPointer(v); /* Use address as hash value */
   }
   /* If there's a cmp but no hash defined, the object can't be hashed */
   return PyObject_HashNotImplemented(v);
}

######举例分析怎么获取string对象的hash

string对象的hash值获取，先看string对象的定义

typedef struct {
   PyObject_VAR_HEAD
   long ob_shash;
   int ob_sstate;
   char ob_sval[1];

   /* Invariants:
    *     ob_sval contains space for 'ob_size+1' elements.
    *     ob_sval[ob_size] == 0.
    *     ob_shash is the hash of the string or -1 if not computed yet.
    *     ob_sstate != 0 iff the string object is in stringobject.c's
    *       'interned' dictionary; in this case the two references
    *       from 'interned' to this object are *not counted* in ob_refcnt.
    */
} PyStringObject;

每个string对象有一个 ob_shash，这个值就是该string的hash值。这个值就是通过tp_hash获取的。具体可以参考源码Object/stringobject.c中的 static long string_hash()函数

综上：hash函数进行散列之前，会先获取每个对象的hash值，如果该对象有实现tp_hash函数，就调用该函数，如果没有就使用该对象的内存地址的值作为hash值，然后用该值对 ma_mask取余获取该对象存储到table中的位置。

####2. 碰撞时的解决方式 hash散列发生碰撞的解决方法主要有：

开放地址法，
再散列法，
链地址法等等。

python字典中使用的是再散列法，函数如下：

 j = (5*j) + 1 + perturb;
 perturb >>= PERTURB_SHIFT(default=5);
 use j % 2**i as the next table index;

其中，perturb初始值是对象的hash值，

####3. table大小的重新调整

什么时候需要重新调整table的大小呢， hash表的性能主要表现在装填因子上，

散列表的装填因子定义为：α= 填入表中的元素个数 / 散列表的长度

python的字典实现中，当装填因子大于 2/3 时就进行重现调整table的大小，调整的过程其实就是新开辟一个计算得出的新大小的table空间，然后将旧table中的entry重新计算写入新table中。

 /* * If fill >= 2/3 size, adjust size.  Normally, this doubles or
     * quaduples the size, but it's also possible for the dict to shrink
     * (if ma_fill is much larger than ma_used, meaning a lot of dict
     * keys have been * deleted).
     *
     * Quadrupling the size improves average dictionary sparseness
     * (reducing collisions) at the cost of some memory and iteration
     * speed (which loops over every possible entry).  It also halves
     * the number of expensive resize operations in a growing dictionary.
     *
     * Very large dictionaries (over 50K items) use doubling instead.
     * This may help applications with severe memory constraints.
     */

值得注意的是：上面提到的fill 是 ma_fill(ma_fill=active+dummy)。也就是说这个装填因子的计算考虑到了那些 delete 的对象，就是删除了，仍然计算在内。

###PyDictObject对象的创建，插入与删除

这部分内容比较简单，直接看源码就行，后面再分析

转载于:https://my.oschina.net/mopidick/blog/751210

choumianxian2107

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python源码分析-PyDictObject

目前Cpython使用最多，下面分析下python中字典的源码实现###数据结构####1. PyDictObjectPyDictObject是python字典对应的C对象，本质上是一个hash表基本元素的组合，包含3个元素：一个table（可以看成是一个数组）hash函数表格中...
复制链接

扫一扫