Python的实现细节——绝知此事要躬行

云中君不见

已于 2022-11-13 17:49:19 修改

阅读量217

点赞数

文章标签： python

于 2022-11-12 20:08:45 首次发布

本文链接：https://blog.csdn.net/cendrier/article/details/127823135

版权

Design and History FAQ 这篇官方文档讲了Python的一些设计细节，以及为什么要这么设计。

我挑了几个有意思的问题和回答，应该对我们加深对Python的理解有所帮助。

Python如何进行内存管理？

Python的官方解释器（或者叫编译器）——CPython——利用“引用计数”机制以及另一种可以探测循环引用的机制进行垃圾回收。“引用计数”，简单理解就是，如果一个object被引用的次数等于0，那么它可以被回收。
关于这个引用计数，我们可以用 sys.getrefcount() 函数获取某个object的引用次数。有位老哥做了个很有意思的实验：sys.getrefcount() prints one more than the expected number of references to an object?

对于循环引用，就是多个object互相引用，最简单的例子就是：

a = []
a.append(a)

The standard implementation of Python, CPython, uses reference counting to detect inaccessible objects, and another mechanism to collect reference cycles, periodically executing a cycle detection algorithm which looks for inaccessible cycles and deletes the objects involved.

注意，以上的垃圾回收机制只针对于CPython。其他解释器，如Jython, PyPy 或许用了其他的垃圾回收机制。这种机制上的不同，可能会引起问题：同一段Python代码在不同解释器中的表现不同。官方文档举了个例子：

在这里插入图片描述
它说，当我们遍历一个文件的长列表，并逐个打开、读取文件时，CPython的引用计数回收机制会在每一次给 f 赋值时，关闭并回收上一个文件流。但其他垃圾回收机制可能并不这么做。为了避免不必要的问题，CPython推荐用 with 方式打开文件，它会帮我们自动关闭文件流。

block, pool, arena 内存管理机制

建立一个object，向Python申请内存时，如果申请的内存大于512个字节，那么Python会直接调用C语言的malloc函数，申请多少内存就返还多少内存。
但如果申请的内存小于512个字节，Python就会从预先向系统申请的内存块中，划拨一块内存存储这个object。Python预申请的这块内存，由arena-pool-block三个层级构成。一个 arena 是一个数组，存储指向各个 pool 的指针；每个 pool 大小都相同，一般是4k, 包含多个 block。从这里拿内存的时候，Python会进行“内存向上对齐”：申请1-8个字节，就分配8个字节；申请8-16个字节，就分配16个字节。特别注意，当我们 free 掉这些小的 object 的时候，此时的内存只是被Python标记为空，可以被使用，而不会真正返还给操作系统。

以上内容参考了这个视频，它讲得更加详细清晰：【python】内存管理结构初探。

CPython是如何实现 list 的？

list 是Python中常用的数据结构，用起来非常方便。但是它在CPython的底层是怎么实现的呢？

CPython’s lists are really variable-length arrays, not Lisp-style linked lists. The implementation uses a contiguous array of references to other objects, and keeps a pointer to this array and the array’s length in a list head structure.

CPython中，list 是通过C语言里的可变长度数组实现的。这个数组存储的是该 list 中每个元素的引用。就像这样：

在这里插入图片描述
每个 list 的长度和它的内存空间长度是两个概念。前者是我们实际理解的列表中元素的个数；后者要比前者更大一些，为了避免多次申请内存。上图的例子中揭示了这种差别。

有一篇博客讲得不错，虽然比较老了，但仍可以帮助我们理解：Python list implementation

This makes indexing a list a[i] an operation whose cost is independent of the size of the list or the value of the index.
When items are appended or inserted, the array of references is resized. Some cleverness is applied to improve the performance of appending items repeatedly; when the array must be grown, some extra space is allocated so the next few times don’t require an actual resize.

由于这个可变长度数组的数组在内存空间中是连续的，所以访问 list 中的某个下标所需时间与 list 长度无关，与下标大小也无关。

del 在 list 中的应用

del is a python statement that removes a name from a namespace, or an item from a dictionary, or an item from a list by using the index.

del 用来删除某个变量名，进而让该变量名指向的对象的引用计数器减1。
注意： del 并不删除对象，而只是删除引用。当某个对象的引用计数为0时，会被Python的垃圾回收机制自动删除。

看两个例子：

from copy import copy
l1 = [1,2,3]
l2 = copy(l1)
l3 = l1
# print(id(l1))
del l1
print(l2)
print(l3)
# print(id(l3))

结果为：

[1, 2, 3]
[1, 2, 3]

删除 l1 并没有删除列表，通过变量 l3依旧可以访问该列表。

在这里插入图片描述

第二个例子：

l1 = [1,2,3]
l2 = copy(l1)
l3 = l1
# print(id(l1))
del l1[0]
print(l2)
print(l3)
# print(id(l3))

删除了 l1[0] ，不会影响 l2，因为此时 l2 是另一个独立的 list （这里是浅拷贝，但由于列表不包含容器，所以和深拷贝效果相同）

在这里插入图片描述

看起来del l1[0] 是 O(n)的，比较耗时。

CPython是如何实现 dictionary 的？

CPython中，dictionary 是通过C语言里的可变长度哈希表实现的。

CPython’s dictionaries are implemented as resizable hash tables.
Dictionaries work by computing a hash code for each key stored in the dictionary using the hash() built-in function. The hash code varies widely depending on the key and a per-process seed. The hash code is then used to calculate a location in an internal array where the value will be stored. Assuming that you’re storing keys that all have different hash values, this means that dictionaries take constant time – O(1) – to retrieve a key.

如果字典里的 key 的哈希值各不相同，那么访问字典中某个 key 的 value 所花费时间是常数，不取决于字典大小。至于为什么字典里不同的 key 可以有相同的哈希值，可以参考拙作：当我发现Python字典中不同 key 可以有相同哈希值后——问渠那得清如许

为什么Python允许在 list, tuple 最后一个元素后面加逗号？

就像这样：

[1, 2, 3,]
('a', 'b', 'c',)
d = {
    "A": [1, 5],
    "B": [6, 7],  # last trailing comma is optional but good style
}

实际上，Python不仅允许这么做，而且十分鼓励这么做。这样不仅可以方便之后添加元素，而且可以防止一些小错误。

x = [
  "fee",
  "fie"
  "foo",
  "fum"
]

x 看起来长度为4，实际上长度为3. 造成这种bug的原因可能是，fie原本是最后一个元素，但后来添加了 foo 和 fum ，但忘记在前面 fie 的后面加逗号了。这种bug往往很难检查，而一个好的编程习惯可以预防它们。

云中君不见

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python的实现细节——绝知此事要躬行

Python内存管理；列表、字典的底层实现
复制链接

扫一扫