Python中的字符串对象（《Python源码剖析》笔记三）-CSDN博客

这是我的关于《Python源码剖析》一书的笔记的第三篇。 Learn Python by Analyzing Python Source Code · GitBook

Python中的字符串对象

在Python3中，str类型就是Python2的unicode类型，之前的str类型转化成了一个新的bytes类型。我们可以分析bytes类型的实现，也就是《Python源码剖析》中的内容，但鉴于我们对str类型的常用程度，且我们对它较浅的理解，所以我们来剖析一下这个相较而言复杂得多的类型。

在之前的分析中，Python2中的整数对象是定长对象，而字符串对象则是变长对象。同时字符串对象又是一个不可变对象，创建之后就无法再改变它的值。

Unicode的四种形式

在Python3中，一个unicode字符串有四种形式：

compact ascii
compact
legacy string， not ready
legacy string ，ready

compact的意思是，假如一个字符串对象是compact的模式，它将只使用一个内存块来存储内容，也就是说，在内存中字符是紧紧跟在结构体后面的。对于non-compact的对象来说，也就是PyUnicodeObject，Python使用一个内存块来保存PyUnicodeObject结构体，另一个内存块来保存字符。

对于ASCII-only的字符串，Python使用PyUnicode_New来创建，并将其保存在PyASCIIObject结构体中。只要它是通过UTF-8来解码的，utf-8字符串就是数据本身，也就是说两者等价。

legacy string 是通过PyUnicodeObject来保存的。

我们先看源码，然后再叙述其他内容。

typedef struct {
    PyObject_HEAD
    Py_ssize_t length;          /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        unsigned int interned:2;
        unsigned int kind:3;
        unsigned int compact:1;
        unsigned int ascii:1;
        unsigned int ready:1;       
        unsigned int :24;
    } state;
    wchar_t *wstr;              /* wchar_t representation (null-terminated) */
} PyASCIIObject;

typedef struct {
    PyASCIIObject _base;
    Py_ssize_t utf8_length;     /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
    Py_ssize_t wstr_length;     /* Number of code points in wstr, possible
                                 * surrogates count as two code points. */
} PyCompactUnicodeObject;

typedef struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;复制代码

可以看出，整个字符串对象机制以PyASCIIObject为基础，我们就先来看这个对象。length中保存了字符串中code points的数量。hash中则保存了字符串的hash值，因为一个字符串对象是不可变对象，它的hash值永远不会改变，因此Python将其缓存在hash变量中，防止重复计算带来的性能损失。state结构体中保存了关于这个对象的一些信息，它们和我们之前介绍的字符串的四种形式有关。wstr变量则是字符串对象真正的值所在。

state结构体中的变量都是什么意思？为了节省篇幅，我将注释删除了，我们来一一解释。interned变量的值和字符串对象的intern机制有关，它可以有三个值：SSTATE_NOT_INTERNED (0)，SSTATE_INTERNED_MORTAL (1)，SSTATE_INTERNED_IMMORTAL (2)。分别表示不intern，intern但可删除，永久intern。具体的机制我们后面会说。kind主要是表示字符串以几字节的形式保存。compact我们已经解释，ascii也很好理解。ready则是用来说明对象的布局是否被初始化。如果是1，就说明要么这个对象是紧凑的（compact），要么它的数据指针已经被填满了。

我们前面提到，一个ASCII字符串使用PyUnicode_New来创建，并保存在PyASCIIObject结构体中。同样使用PyUnicode_New创建的字符串对象，如果是非ASCII字符串，则保存在PyCompactUnicodeObject结构体中。一个PyUnicodeObject通过PyUnicode_FromUnicode(NULL, len)创建，真正的字符串数据一开始保存在wstr block中，然后使用_PyUnicode_Ready被复制到了data block中。

我们再来看一下PyUnicode_Type：

PyTypeObject PyUnicode_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",              /* tp_name */
    sizeof(PyUnicodeObject),        /* tp_size */
    ……
    unicode_repr,           /* tp_repr */
    &unicode_as_number,         /* tp_as_number */
    &unicode_as_sequence,       /* tp_as_sequence */
    &unicode_as_mapping,        /* tp_as_mapping */
    (hashfunc) unicode_hash,        /* tp_hash*/
    ……
};复制代码

可以看出，Python3中的str的确就是之前的unicode。

创建字符串对象

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
{
    PyObject *unicode;
    Py_UCS4 maxchar = 0;
    Py_ssize_t num_surrogates;

    if (u == NULL)
        return (PyObject*)_PyUnicode_New(size);

    /* If the Unicode data is known at construction time, we can apply
       some optimizations which share commonly used objects. */

    /* Optimization for empty strings */
    if (size == 0)
        _Py_RETURN_UNICODE_EMPTY();

    /* Single character Unicode objects in the Latin-1 range are
       shared when using this constructor */
    if (size == 1 && (Py_UCS4)*u < 256)
        return get_latin1_char((unsigned char)*u);

    /* If not empty and not single character, copy the Unicode data
       into the new object */
    if (find_maxchar_surrogates(u, u + size,
                                &maxchar, &num_surrogates) == -1)
        return NULL;

    unicode = PyUnicode_New(size - num_surrogates, maxchar);
    if (!unicode)
        return NULL;

    switch (PyUnicode_KIND(unicode)) {
    case PyUnicode_1BYTE_KIND:
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, unsigned char,
                                u, u + size, PyUnicode_1BYTE_DATA(unicode));
        break;
    case PyUnicode_2BYTE_KIND:
#if Py_UNICODE_SIZE == 2
        memcpy(PyUnicode_2BYTE_DATA(unicode), u, size * 2);
#else
        _PyUnicode_CONVERT_BYTES(Py_UNICODE, Py_UCS2,
                                u, u + size, PyUnicode_2BYTE_DATA(unicode));
#endif
        break;
    case PyUnicode_4BYTE_KIND:
#if SIZEOF_WCHAR_T == 2
        /* This is the only case which has to process surrogates, thus
           a simple copy loop is not enough and we need a function. */
        unicode_convert_wchar_to_ucs4(u, u + size, unicode);
#else
        assert(num_surrogates == 0);
        memcpy(PyUnicode_4BYTE_DATA(unicode), u, size * 4);
#endif
        break;
    default:
        assert(0 && "Impossible state");
    }

    return unicode_result(unicode);
}
PyObject *PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
    PyObject *obj;
    PyCompactUnicodeObject *unicode;
    void *data;
    enum PyUnicode_Kind kind;
    int is_sharing, is_ascii;
    Py_ssize_t char_size;
    Py_ssize_t struct_size;

    /* Optimization for empty strings */
    if (size == 0 && unicode_empty != NULL) {
        Py_INCREF(unicode_empty);
        return unicode_empty;
    }

    is_ascii = 0;
    is_sharing = 0;
    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
        is_ascii = 1;
        struct_size = sizeof(PyASCIIObject);
    }
    else if (maxchar < 256) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
    }
    else if (maxchar < 65536) {
        kind = PyUnicode_2BYTE_KIND;
        char_size = 2;
        if (sizeof(wchar_t) == 2)
            is_sharing = 1;
    }
    else {
        if (maxchar > MAX_UNICODE) {
            PyErr_SetString(PyExc_SystemError,
                            "invalid maximum character passed to PyUnicode_New");
            return NULL;
        }
        kind = PyUnicode_4BYTE_KIND;
        char_size = 4;
        if (sizeof(wchar_t) == 4)
            is_sharing = 1;
    }

    /* Ensure we won't overflow the size. */
    if (size < 0) {
        PyErr_SetString(PyExc_SystemError,
                        "Negative size passed to PyUnicode_New");
        return NULL;
    }
    if (size > ((PY_SSIZE_T_MAX - struct_size) / char_size - 1))
        return PyErr_NoMemory();

    /* Duplicated allocation code from _PyObject_New() instead of a call to
     * PyObject_New() so we are able to allocate space for the object and
     * it's data buffer.
     */
    obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
    if (obj == NULL)
        return PyErr_NoMemory();
    obj = PyObject_INIT(obj, &PyUnicode_Type);
    if (obj == NULL)
        return NULL;

    unicode = (PyCompactUnicodeObject *)obj;
    if (is_ascii)
        data = ((PyASCIIObject*)obj) + 1;
    else
        data = unicode + 1;
    _PyUnicode_LENGTH(unicode) = size;
    _PyUnicode_HASH(unicode) = -1;
    _PyUnicode_STATE(unicode).interned = 0;
    _PyUnicode_STATE(unicode).kind = kind;
    _PyUnicode_STATE(unicode).compact = 1;
    _PyUnicode_STATE(unicode).ready = 1;
    _PyUnicode_STATE(unicode).ascii = is_ascii;
    if (is_ascii) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
    }
    else if (kind == PyUnicode_1BYTE_KIND) {
        ((char*)data)[size] = 0;
        _PyUnicode_WSTR(unicode) = NULL;
        _PyUnicode_WSTR_LENGTH(unicode) = 0;
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
    }
    else {
        unicode->utf8 = NULL;
        unicode->utf8_length = 0;
        if (kind == PyUnicode_2BYTE_KIND)
            ((Py_UCS2*)data)[size] = 0;
        else /* kind == PyUnicode_4BYTE_KIND */
            ((Py_UCS4*)data)[size] = 0;
        if (is_sharing) {
            _PyUnicode_WSTR_LENGTH(unicode) = size;
            _PyUnicode_WSTR(unicode) = (wchar_t *)data;
        }
        else {
            _PyUnicode_WSTR_LENGTH(unicode) = 0;
            _PyUnicode_WSTR(unicode) = NULL;
        }
    }
#ifdef Py_DEBUG
    unicode_fill_invalid((PyObject*)unicode, 0);
#endif
    assert(_PyUnicode_CheckConsistency((PyObject*)unicode, 0));
    return obj;
}复制代码

先来分析PyUnicode_FromUnicode的流程。如果传入的u是个空指针，调用_PyUnicode_New(size)直接返回一个指定大小但值为空的PyUnicodeObject对象。如果size==0，调用_Py_RETURN_UNICODE_EMPTY()直接返回。如果是在Latin-1范围内的单字符字符串，直接返回该字符对应的PyUnicodeObject，这和我们在上一章说的小整数对象池类似，这里也有一个字符缓冲池。如果两者都不是，则创建一个新的对象并将数据复制到这个对象中。

PyUnicode_New的流程很好理解，传入对象的大小和maxchar，根据这两个参数来决定返回的是PyASCIIObject，PyCompactUnicodeObject还是PyUnicodeObject。

Intern机制

我们之前提到了intern机制，它指的就是在创建一个新的字符串对象时，如果已经有了和它的值相同的字符串对象，那么就直接返回那个对象的引用，而不返回新创建的字符串对象。Python在那里寻找呢？事实上，python维护着一个键值对类型的结构interned，键就是字符串的值。但这个intern机制并非对于所有的字符串对象都适用，简单来说对于那些符合python标识符命名原则的字符串，也就是只包括字母数字下划线的字符串，python会对它们使用intern机制。在标准库中，有一个函数可以让我们对一个字符串强制实行这个机制——sys.intern()，下面是这个函数的文档：

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys.

Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

具体机制见下面代码：

PyObject *PyUnicode_InternFromString(const char *cp)
{
    PyObject *s = PyUnicode_FromString(cp);
    if (s == NULL)
        return NULL;
    PyUnicode_InternInPlace(&s);
    return s;
}复制代码

void PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;
    PyObject *t;
#ifdef Py_DEBUG
    assert(s != NULL);
    assert(_PyUnicode_CHECK(s));
#else
    if (s == NULL || !PyUnicode_Check(s))
        return;
#endif
    /* If it's a subclass, we don't really know what putting
       it in the interned dict might do. */
    if (!PyUnicode_CheckExact(s))
        return;
    if (PyUnicode_CHECK_INTERNED(s))
        return;
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear(); /* Don't leave an exception */
            return;
        }
    }
    Py_ALLOW_RECURSION
    t = PyDict_SetDefault(interned, s, s);
    Py_END_ALLOW_RECURSION
    if (t == NULL) {
        PyErr_Clear();
        return;
    }
    if (t != s) {
        Py_INCREF(t);
        Py_SETREF(*p, t);
        return;
    }
    /* The two references in interned are not counted by refcnt.
       The deallocator will take care of this */
    Py_REFCNT(s) -= 2;
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}复制代码

当Python调用PyUnicode_InternFromString时，会返回一个interned的对象，具体过程由PyUnicode_InternInPlace来实现。

事实上，即使Python会对一个字符串进行intern操作，它也会先创建出一个PyUnicodeObject对象，之后再检查是否有值和其相同的对象。如果有的话，就将interned中保存的对象返回，之前新创建出来的，因为引用计数变为零，被回收了。

被intern机制处理后的对象分为两类：mortal和immortal，前者会被回收，后者则不会被回收，与Python虚拟机共存亡。

PyUnicodeObject有关的效率问题

在《Python源码剖析》原书中提到使用+来连接字符串是一个极其低效的操作，因为每次连接都会创建一个新的字符串对象，推荐使用字符串的join方法来连接字符串。在Python3.6下，经过我的测试，使用+来连接字符串已经和使用join的耗时相差不大。当然这只是我在个别环境下的测试，真正的答案我还不知道。

小结

在Python3中，str底层实现使用unicode，这很好的解决了Python2中复杂麻烦的非ASCII字符串的种种问题。同时在底层，Python对于ASCII和非ASCII字符串区别对待，加上utf-8兼容ASCII字符，兼顾了性能和简单程度。在Python中，不可变对象往往都有类似intern机制的东西，这使得Python减少了不必要的内存消耗，但是在真正的实现中，Python也是取平衡点。因为，一味使用intern机制，有可能会造成额外的计算和查找，这就和优化性能的目的背道而驰了。