撰寫自己的Python C擴展!

前言

本篇為Extending and Embedding the Python Interpreter系列第一篇1. Extending Python with C or C++的學習筆記。

Python平易近人但效率較差,所以有時候我們會希望能用C語言實作某些功能,再由Python端調用。

舉個例子,假設我們想要寫一個名為spam的Python包,並透過以下方式調用:

import spam
status = spam.system("ls -l")

可以料想到我們得先用C語言實作system函數,並將它打包成package才能夠被Python端調用。分為以下步驟:

  • 實作C函數

  • 定義method table並把C函數加入table中

  • 定義Python模組並把method table與Python模組做關聯

  • 定義模組初始化函數

  • 編譯成.so.pyd

但是注意此處介紹的方法只適用於CPython版本的Python。摘自教學本文:

The C extension interface is specific to CPython

首先創建一個名為spam.c的檔案,然後依照如下步驟填入內容。

spam.c

header

為了撰寫一個能夠從Python端調用的C函數,需要事先引入Python.h這個header:

#define PY_SSIZE_T_CLEAN
#include <Python.h>

注:其中#define PY_SSIZE_T_CLEAN的作用是讓之後會提到的PyArg_ParseTuple函數把s#的長度參數當成Py_ssize_t而非int型別,詳見PY_SSIZE_T_CLEAN macro的作用

C函數

因為在Python端是透過spam.system調用,所以此處將函數取名為spam_system,定義如下:

// the function that will be called by spam.system(string) from python
static PyObject*
spam_system(PyObject* self, PyObject* args) {
    // The C function always has two arguments, conventionally named self and args
    // The self argument points to the module object for module-level functions; for a method it would point to the object instance
    // The args argument will be a pointer to a Python tuple object containing the arguments
    const char* command;
    int sts;

    // checks the argument types and converts Python objects to C values
    // on success, the string value of the argument will be copied to the local variable command, and returns true
    // returns false, and may raise PyExc_TypeError on failure
    if(!PyArg_ParseTuple(args, "s", &command))
        // for functions returning object pointers, NULL is the error indicator 
        return NULL;
    sts = system(command);
    if(sts < 0){
        // raise a spam.error exception defined in PyInit_spam
        PyErr_SetString(SpamError, "System command failed");
        return NULL;
    }
    // return an integer object
    return PyLong_FromLong(sts);
    // for function without return value
    // Method 1:
    // Py_INCREF(Py_None);
    // return Py_None;
    // Note: Py_None is the C name for the special Python object None
    // Method 2:
    // Py_RETURN_NONE
}

這段代碼包含了幾個重點,一一分析如下。

參數

C函數總是接受selfargs這兩個參數:

  • self:根據該函數是module-level function還是class method,self參數會分別指向module物件或是該類別的物件
  • args:指向的一個包含函數參數的Python tuple

參數解析

收到args參數後,PyArg_ParseTuple(args, "s", &command)會將它解析為字串型別("s")的command,如果失敗,則回傳空指標NULL。關於PyArg_ParseTuple函數,詳見Extracting Parameters in Extension Functions

函數主體

如果成功,就會接著調用system系統函數:sts = system(command);得到int型別的sts返回值。如果system執行失敗(也就是sts小於0的情況),會需要進行錯誤處理。稍後將會詳述。

生成回傳值

函數的回傳值是一個指向Python object(PyObject)的指標,C函數在回傳任何東西前必須先透過Python.h裡的函數將C裡的變數轉換為PyObject*型別。

如此處PyObject *PyLong_FromLong(long v)的作用便是將sts(型別為C的int)轉換成Python的int object(PyObject*)。

method table

接著將spam_system函數註冊到SpamMethods這個method table裡。這個method table稍後會跟名為spam的 模組關聯在一起,使得本函數可以從Python透過spam.system被調用。

// method table
static PyMethodDef SpamMethods[] = {
    {"system", // name
     spam_system, // address
     METH_VARARGS, // or "METH_VARARGS | METH_KEYWORDS"
     // METH_VARARGS: expect the Python-level parameters to be passed in as a tuple acceptable for parsing via PyArg_ParseTuple()
     // METH_KEYWORDS: the C function should accept a third PyObject * parameter which will be a dictionary of keywords. Use PyArg_ParseTupleAndKeywords() to parse
     "Execute a shell command."},
    {NULL, NULL, 0, NULL} // sentinel
};

其中第三個欄位METH_VARARGS表示Python函數將接受positional argument。

PyMethodDef各欄位的具體意義詳見PyMethodDef

模組定義

PyDoc_STRVAR創建一個名為spam_doc的變數,可以作為docstring使用。

// Creates a variable with name name that can be used in docstrings. If Python is built without docstrings, the value will be empty.
PyDoc_STRVAR(spam_doc, "Spam module that call system function.");

我們希望一個寫一個名為spam的Python包/套件,所以此處需要定義spam這個Python module在C裡的映射,命名為spammodule

// module definition structure
static struct PyModuleDef spammodule = {
    PyModuleDef_HEAD_INIT,
    "spam", // name of module
    spam_doc, // module documentation, may be NULL // Docstring for the module; usually a docstring variable created with PyDoc_STRVAR is used.
    -1, // size of per-interpreter state of the module, or -1 if the module keeps state in global variables.
    SpamMethods // the method table
};

PyModuleDef各欄位的具體意義詳見PyModuleDef

模組初始化函數

PyInit_spam函數負責初始化module:

// PyInit_spam is module’s initialization function
// must be named PyInit_name
// it will be called when python program imports module spam for the first time
// should be the only non-static item defined in the module file!
// if adding "static", variables and functions can only be used in the specific file, can't be linked through "extern"
// PyMODINIT_FUNC declares the function as PyObject * return type, declares any special linkage declarations required by the platform, and for C++ declares the function as extern "C"
PyMODINIT_FUNC
PyInit_spam(void){
    PyObject* m;

    // returns a module object, and inserts built-in function objects into the newly created module based upon the table (an array of PyMethodDef structures) found in the module definition
    // The init function must return the module object to its caller, so that it then gets inserted into sys.modules
    m = PyModule_Create(&spammodule);
    if(m == NULL)
        return NULL;

    // if the last 2 arguments are NULL, then it creates a class who base class is Excetion
    // exception type, exception instance, and a traceback object
    SpamError = PyErr_NewException("spam.error", NULL, NULL);
    // retains a reference to the newly created exception class
    // Since the exception could be removed from the module by external code, an owned reference to the class is needed to ensure that it will not be discarded, causing SpamError to become a dangling pointer
    // Should it become a dangling pointer, C code which raises the exception could cause a core dump or other unintended side effects
    Py_XINCREF(SpamError);
    if(PyModule_AddObject(m, "error", SpamError) < 0){
        // clean up garbage (by making Py_XDECREF() or Py_DECREF() calls for objects you have already created) when you return an error indicator
        // Decrement the reference count for object o. The object may be NULL, in which case the macro has no effect; otherwise the effect is the same as for Py_DECREF(), and the same warning applies.
        Py_XDECREF(SpamError);
        // Decrement the reference count for object o. The object may be NULL, in which case the macro has no effect; otherwise the effect is the same as for Py_DECREF(), except that the argument is also set to NULL.
        Py_CLEAR(SpamError);
        // Decrement the reference count for object o.
        // If the reference count reaches zero, the object’s type’s deallocation function (which must not be NULL) is invoked.
        Py_DECREF(m);
        return NULL;
    }

    return m;
}

在Python程式第一次引入模組的時候會調用該模組的初始化函數。初始化函數必須被命名為PyInit_<Python模組的名稱>

初始化函數的關鍵在於PyModule_Create,它會創造一個模組物件。並且將稍早與模組物件關聯的method table插入新建的模組物件中。

初始化函數PyInit_spam最終會把m這個模組物件回傳給它的caller。注意到函數名前面的PyMODINIT_FUNC,它的主要功能就是宣告函數回傳值的型別為PyObject *;另外對於C++,它會將函數宣告為extern "C";對於各種不同的平台,也會為函數做link時所需的宣告。

這段代碼中用到了SpamError物件,其定義如下:

// define your own new exception
static PyObject* SpamError;

PyErr_NewException這句初始化並創建了SpamError這個例外類別。為了避免SpamError之後被外部代碼從module裡被移除,所以需要使用Py_XINCREF來手動增加引用計數。

接著嘗試透過PyModule_AddObjectSpamError加入m這個module裡,如果成功,之後就可以透過spam.error來存取;如果失敗,則需對SpamErrorm減少引用計數做清理。

Py_XDECREFPy_CLEAR都是減少物件的引用計數,為何要對SpamError重複調用?

拋出異常

SpamError物件初始化成功後,在C函數spam_system處就可以使用PyErr_SetString拋出程序異常:

PyErr_SetString(SpamError, "System command failed");

main函數

最後一步是撰寫main函數,調用剛剛定義的Pyinit_spam對模組做初始化:

int main(int argc, char* argv[]){
    wchar_t* program = Py_DecodeLocale(argv[0], NULL);
    if(program == NULL){
    	fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
    	exit(1);
    }

    //add a built-in module, before Py_Initialize
    //When embedding Python, the PyInit_spam() function is not called automatically unless there’s an entry in the PyImport_Inittab table. To add the module to the initialization table, use PyImport_AppendInittab(), optionally followed by an import of the module
    if(PyImport_AppendInittab("spam", PyInit_spam) == -1){
    	fprintf(stderr, "Error: could not extend in-built modules table\n");
    	exit(1);
    }

    // Pass argv[0] to the Python interpreter
    Py_SetProgramName(program);

    //Initialize the Python interpreter.  Required
    //If this step fails, it will be a fatal error.
    Py_Initialize();

    // Optionally import the module; alternatively,
    // import can be deferred until the embedded script imports it.
    PyObject* pmodule = PyImport_ImportModule("spam");
    if(!pmodule){
    	PyErr_Print();
    	fprintf(stderr, "Error: could not import module 'spam'\n");
    }

    PyMem_RawFree(program);
    return 0;
}

在調用Py_Initialize函數對Python解釋器做初始化前,需要先透過PyImport_AppendInittab函數把PyInit_spam函數加入PyImport_Inittab這個table,這樣Py_Initialize才會調用PyInit_spam對spam module做初始化。

為了測試模組初始化成功與否,程式的最後透過PyImport_ImportModule嘗試import spam module。

完整代碼

新建一個名為spam.c的檔案並填入以下內容:

// pulls in the Python API
#define PY_SSIZE_T_CLEAN // Make "s#" use Py_ssize_t rather than int
#include <Python.h> // must be included before any standard headers

// define your own new exception
static PyObject* SpamError;

// the function that will be called by spam.system(string) from python
static PyObject*
spam_system(PyObject* self, PyObject* args) {
    // The C function always has two arguments, conventionally named self and args
    // The self argument points to the module object for module-level functions; for a method it would point to the object instance
    // The args argument will be a pointer to a Python tuple object containing the arguments
    const char* command;
    int sts;

    // checks the argument types and converts Python objects to C values
    // on success, the string value of the argument will be copied to the local variable command, and returns true
    // returns false, and may raise PyExc_TypeError on failure
    if(!PyArg_ParseTuple(args, "s", &command))
        // for functions returning object pointers, NULL is the error indicator 
        return NULL;
    sts = system(command);
    if(sts < 0){
        // raise a spam.error exception defined in PyInit_spam
        PyErr_SetString(SpamError, "System command failed");
        return NULL;
    }
    // return an integer object
    return PyLong_FromLong(sts);
    // for function without return value
    // Method 1:
    // Py_INCREF(Py_None);
    // return Py_None;
    // Note: Py_None is the C name for the special Python object None
    // Method 2:
    // Py_RETURN_NONE
}

// method table
static PyMethodDef SpamMethods[] = {
    {"system", // name
     spam_system, // address
     METH_VARARGS, // or "METH_VARARGS | METH_KEYWORDS"
     // METH_VARARGS: expect the Python-level parameters to be passed in as a tuple acceptable for parsing via PyArg_ParseTuple()
     // METH_KEYWORDS: the C function should accept a third PyObject * parameter which will be a dictionary of keywords. Use PyArg_ParseTupleAndKeywords() to parse
     "Execute a shell command."},
    {NULL, NULL, 0, NULL} // sentinel
};

// Creates a variable with name name that can be used in docstrings. If Python is built without docstrings, the value will be empty.
PyDoc_STRVAR(spam_doc, "Spam module that call system function.");

// module definition structure
static struct PyModuleDef spammodule = {
    PyModuleDef_HEAD_INIT,
    "spam", // name of module
    spam_doc, // module documentation, may be NULL // Docstring for the module; usually a docstring variable created with PyDoc_STRVAR is used.
    -1, // size of per-interpreter state of the module, or -1 if the module keeps state in global variables.
    SpamMethods // the method table
};

// PyInit_spam is module’s initialization function
// must be named PyInit_name
// it will be called when python program imports module spam for the first time
// should be the only non-static item defined in the module file!
// if adding "static", variables and functions can only be used in the specific file, can't be linked through "extern"
// PyMODINIT_FUNC declares the function as PyObject * return type, declares any special linkage declarations required by the platform, and for C++ declares the function as extern "C"
PyMODINIT_FUNC
PyInit_spam(void){
    PyObject* m;

    // returns a module object, and inserts built-in function objects into the newly created module based upon the table (an array of PyMethodDef structures) found in the module definition
    // The init function must return the module object to its caller, so that it then gets inserted into sys.modules
    m = PyModule_Create(&spammodule);
    if(m == NULL)
        return NULL;

    // if the last 2 arguments are NULL, then it creates a class who base class is Excetion
    // exception type, exception instance, and a traceback object
    SpamError = PyErr_NewException("spam.error", NULL, NULL);
    // retains a reference to the newly created exception class
    // Since the exception could be removed from the module by external code, an owned reference to the class is needed to ensure that it will not be discarded, causing SpamError to become a dangling pointer
    // Should it become a dangling pointer, C code which raises the exception could cause a core dump or other unintended side effects
    Py_XINCREF(SpamError);
    if(PyModule_AddObject(m, "error", SpamError) < 0){
        // clean up garbage (by making Py_XDECREF() or Py_DECREF() calls for objects you have already created) when you return an error indicator
        // Decrement the reference count for object o. The object may be NULL, in which case the macro has no effect; otherwise the effect is the same as for Py_DECREF(), and the same warning applies.
        Py_XDECREF(SpamError);
        // Decrement the reference count for object o. The object may be NULL, in which case the macro has no effect; otherwise the effect is the same as for Py_DECREF(), except that the argument is also set to NULL.
        Py_CLEAR(SpamError);
        // Decrement the reference count for object o.
        // If the reference count reaches zero, the object’s type’s deallocation function (which must not be NULL) is invoked.
        Py_DECREF(m);
        return NULL;
    }

    return m;
}

int main(int argc, char* argv[]){
    wchar_t* program = Py_DecodeLocale(argv[0], NULL);
    if(program == NULL){
        fprintf(stderr, "Fatal error: cannot decode argv[0]\n");
        exit(1);
    }

    //add a built-in module, before Py_Initialize
    //When embedding Python, the PyInit_spam() function is not called automatically unless there’s an entry in the PyImport_Inittab table. To add the module to the initialization table, use PyImport_AppendInittab(), optionally followed by an import of the module
    if(PyImport_AppendInittab("spam", PyInit_spam) == -1){
        fprintf(stderr, "Error: could not extend in-built modules table\n");
        exit(1);
    }

    // Pass argv[0] to the Python interpreter
    Py_SetProgramName(program);

    //Initialize the Python interpreter.  Required
    //If this step fails, it will be a fatal error.
    Py_Initialize();

    // Optionally import the module; alternatively,
    // import can be deferred until the embedded script imports it.
    PyObject* pmodule = PyImport_ImportModule("spam");
    if(!pmodule){
        PyErr_Print();
        fprintf(stderr, "Error: could not import module 'spam'\n");
    }

    PyMem_RawFree(program);
    return 0;
}

編譯及鏈接

使用gcc

參考Python进阶笔记C语言拓展篇(二)动态链接库pyd+存根文件pyi——学会优雅地使用C\C++拓展Python模块的正确/官方姿势Build .so file from .c file using gcc command line,使用以下指令編譯及鏈接。

分成兩步,首先編譯,得到spam.o:

gcc -c -I /usr/include/python3.8 -o spam.o -fPIC spam.c

然後鏈接,得到spam.so:

gcc -shared -L /usr/lib/python3.8/config-3.8-x86_64-linux-gnu -lpython3.8 -o spam.so spam.o

注:如果是Windows平台,需將.so改成.pyd

這兩步可以合併為:

gcc -shared -I /usr/include/python3.8 -L /usr/lib/python3.8/config-3.8-x86_64-linux-gnu -o spam.so -fPIC spam.c

使用gcc + Python flags

可以善用Python flags,這樣我們就不必手動去尋找Python的include和lib目錄。

分以下這兩步:

gcc -c $(python3.8-config --includes) -o spam.o -fPIC spam.c
gcc -shared $(python3.8-config --ldflags) -o spam.so -fPIC spam.o

可以合併為:

gcc -shared $(python3.8-config --includes) $(python3.8-config --ldflags) -o spam.so -fPIC spam.c

使用distutils

參考如Py似C:Python 與 C 的共生法則4. Building C and C++ Extensions,創建setup.py,填入以下內容:

from distutils.core import setup, Extension
spammodule = Extension('spam', sources=['spam.c'])

setup(name='Spam',
  description='',
  ext_modules=[spammodule],
)

然後用以下指令編譯生成.so

python3 setup.py build_ext --inplace

結果如下:

.
├── build
│   └── temp.linux-x86_64-3.8
│       └── spam.o
└── spam.cpython-38-x86_64-linux-gnu.so

從Python調用C函數

直接import

C++編出so或pyd(動態鏈接庫,相當於dll)後Python可以直接import:

>>> import spam
>>> spam.system("ls")
spam.c  spam.o  spam.so
0 # spam.system的回傳值

使用distutils

使用distutils編譯出來的so調用方式跟直接import一樣。

如果希望將package name由spam改成spammodule該怎麼做呢?

因為初始化函數的名稱必須是PyInit_<modname>,所以首先將PyInit_spam改成PyInit_spammodulePyImport_AppendInittab的第一個參數和PyImport_ImportModule的參數代表的也是Python模組名稱,所以也需要將它們改成spammodule。修改之後可以編譯成功,但import時會出現:

>>> import spam
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dynamic module does not define module export function (PyInit_spam)
>>> import spammodule
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spammodule'

參考Unable to solve “ImportError: dynamic module does not define module export function”

In PyInit_<modname> modname should match the filename.

再把檔名由spam.c改成spammodule.c,記得Extension的第二個參數也要改,但結果還是跟上面一樣。

改了這樣還不夠,根據distutils.core.Extension,它的name參數和sources參數分別代表:

name: the full name of the extension, including any packages — ie. not a filename or pathname, but Python dotted name
sources: list of source filenames, relative to the distribution root (where the setup script lives), in Unix form (slash-separated) for portability. Source files may be C, C++, SWIG (.i), platform-specific resource files, or whatever else is recognized by the build_ext command as source for a Python extension.

Extensionname參數表示import時Python函式庫的名字,因此也需要修改。修改之後便可以成功運行:

>>> import spam
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'spam'
>>> import spammodule
>>> spammodule.system("ls")
build  setup.py  spammodule.c  spammodule.cpython-38-x86_64-linux-gnu.so
0

至於distutils.core.setup的第一個參數,參考Distutils Examples - Pure Python distribution

Note that the name of the distribution is specified independently with the name option, and there’s no rule that says it has to be the same as the name of the sole module in the distribution

它決定的是發布時的套件(package)名稱,因此可以使用與模組不同的名字。

透過ctypes調用so檔

參考How to run .so files using through python script,使用cdll.LoadLibrary導入so檔:

>>> from ctypes import cdll, c_char_p, c_wchar_p
>>> spam = cdll.LoadLibrary("spam.cpython-38-x86_64-linux-gnu.so")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
    return self._dlltype(name)
  File "/usr/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: spam.cpython-38-x86_64-linux-gnu.so: cannot open shared object file: No such file or directory

找不到當前目錄下的.so,在前面加上./就可以找到了(不知為何不加./就會找不到?):

>>> spam = cdll.LoadLibrary("./spam.cpython-38-x86_64-linux-gnu.so")
>>> spam = cdll.LoadLibrary("./spam.so")

嘗試調用spam.system函數,直接傳入字串參數,卻發現執錯誤:

>>> spam.system("ls")
sh: 1: l: not found
32512

查看ctypes文檔

None, integers, bytes objects and (unicode) strings are the only native Python objects that can directly be used as parameters in these function calls. None is passed as a C NULL pointer, bytes objects and strings are passed as pointer to the memory block that contains their data (char * or wchar_t *). Python integers are passed as the platforms default C int type, their value is masked to fit into the C type.

才發現這是因為Python中只有None,整數,bytes和unicode字串才可以直接作為參數被傳入函數。其它型別則需要用Fundamental data types列出的函數做轉換。

注意其中c_char_p雖然是回傳char*,但它只接受bytes object,如果傳入字串的話會報錯:

>>> c_char_p("ls")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: bytes or integer address expected instead of str instance

c_wchar_p雖然接受字串為參數,但是它回傳的型別是wchar_t*,所以一樣會出錯:

>>> spam.system(c_wchar_p("ls"))
sh: 1: l: not found
32512

參考Different behaviour of ctypes c_char_p?,正確的做法是先將Python字串轉換為bytes object,傳入c_char_p得到char*後,才能調用extension函數。有兩種方法,參考【Python】隨記:印出來常見「b」,但究竟什麼是b呢?

>>> spam.system(c_char_p(b"ls"))
build  demo  setup.py  spam.c  spam.cpython-38-x86_64-linux-gnu.so  system.c  test.py
0

或:

>>> spam.system(c_char_p("ls".encode('utf-8')))
build  demo  setup.py  spam.c  spam.cpython-38-x86_64-linux-gnu.so  system.c  test.py
0

在字串前面加b和在後面加.encode('utf-8')有何區別呢?可以來實驗一下:

>>> s = "str"
>>> s.encode("utf-8")
b'str'
>>> s.encode("ascii")
b'str'
>>> s.encode("utf-8") == b'str'
True
>>> s.encode("ascii") == b'str'
True
>>> s.encode("utf-8") == s.encode("ascii")
True
>>> s = "str"
>>> c_char_p(s.encode('utf-8'))
c_char_p(140096259724976)
>>> c_char_p(b"str")
c_char_p(140096243429648)

用UTF-8和ascii編碼後的字串都與字串前面加b相等。但為何這兩種編碼方式得到的byte string是一樣的呢?可以猜想這是因為UTF-8編碼的前128個字元就是ASCII編碼,所以對於純英文的字串,使用這兩種編碼所得到的byte string是相同的。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值