C++ 与 Python相互调用

最新推荐文章于 2024-03-20 07:45:17 发布

黑贝是条狗

最新推荐文章于 2024-03-20 07:45:17 发布

阅读量2.5k

点赞数

分类专栏： python c/c++ 文章标签： python c++ 开发语言 Powered by 金山文档

原文链接：https://cloud.tencent.com/developer/article/1919872

版权

c/c++ 同时被 2 个专栏收录

150 篇文章 13 订阅

订阅专栏

python

26 篇文章 0 订阅

订阅专栏

1. 背景

目前AI算法开发特别是训练基本都以Python为主，主流的AI计算框架如TensorFlow、PyTorch等都提供了丰富的Python接口。有句话说得好，人生苦短，我用Python。但由于Python属于动态语言，解释执行并缺少成熟的JIT方案，计算密集型场景多核并发受限等原因，很难直接满足较高性能要求的实时Serving需求。在一些对性能要求高的场景下，还是需要使用C/C++来解决。但是如果要求算法同学全部使用C++来开发线上推理服务，成本又非常高，导致开发效率和资源浪费。因此，如果有轻便的方法能将Python和部分C++编写的核心代码结合起来，就能达到既保证开发效率又保证服务性能的效果。本文主要介绍pybind11在腾讯广告多媒体AI Python算法的加速实践，以及过程中的一些经验总结。

2. 业内方案

2.1 原生方案

Python官方提供了Python/C API，可以实现「用C语言编写Python库」，先上一段代码感受一下：

static PyObject *spam_system(PyObject *self, PyObject *args){const char *command;
    int sts;if(!PyArg_ParseTuple(args,"s",&command))returnNULL;
    sts =system(command);returnPyLong_FromLong(sts);}

复制

可见改造成本非常高，所有的基本类型都必须手动改为CPython解释器封装的binding类型。由此不难理解，为何Python官网也建议大家使用第三方解决方案1。

2.2 Cython

Cython主要打通的是Python和C，方便为Python编写C扩展。Cython 的编译器支持转化 Python 代码为 C 代码，这些 C 代码可以调用 Python/C 的 API。从本质上来说，Cython 就是包含 C 数据类型的 Python。目前Python的numpy，以及我厂的tRPC-Python框架有所应用。

缺点：

需要手动植入Cython自带语法（cdef 等），移植和复用成本高

需要增加其他文件，如setup.py、*.pyx来让你的Python代码最后能够转成性能较高的C代码

对于C++的支持程度存疑

2.3 SIWG

SIWG主要解决其他高级语言与C和C++语言交互的问题，支持十几种编程语言，包括常见的java、C#、javascript、Python等。使用时需要用*.i文件定义接口，然后用工具生成跨语言交互代码。但由于支持的语言众多，因此在Python端性能表现不是太好。

值得一提的是，TensorFlow早期也是使用SWIG来封装Python接口，正式由于SIWG存在性能不够好、构建复杂、绑定代码晦涩难读等问题，TensorFlow已于2019年将SIWG切换为pybind112。

2.4 Boost.Python

C++中广泛应用的Boost开源库，也提供了Python binding功能。使用上，通过宏定义和元编程来简化Python的API调用。但最大的缺点是需要依赖庞大的Boost库，编译和依赖关系包袱重，只用于解决Python binding的话有一种高射炮打蚊子的既视感。

2.5 pybind11

可以理解为以Boost.Python为蓝本，仅提供Python & C++ binding功能的精简版，相对于Boost.Python在binary size以及编译速度上有不少优势。对C++支持非常好，基于C++11应用了各种新特性，也许pybind11的后缀11就是出于这个原因。

Pybind11 通过 C++ 编译时的自省来推断类型信息，来最大程度地减少传统拓展 Python 模块时繁杂的样板代码，且实现了常见数据类型，如 STL 数据结构、智能指针、类、函数重载、实例方法等到Python的自动转换，其中函数可以接收和返回自定义数据类型的值、指针或引用。

特点：

轻量且功能单一，聚焦于提供C++ & Python binding，交互代码简洁

对常见的C++数据类型如STL、Python库如numpy等兼容很好，无人工转换成本

only header方式，无需额外代码生成，编译期即可完成绑定关系建立，减小binary大小

支持C++新特性，对C++的重载、继承，debug方式便捷易用

完善的官方文档支持，应用于多个知名开源项目

“Talk is cheap, show me your code.” 三行代码即可快速实现绑定，你值得拥有：

PYBIND11_MODULE(libcppex, m){
    m.def("add",[](int a, int b)-> int {return a + b;});}

复制

3. Python调C++

3.1 从GIL锁说起

GIL（Global Interpreter Lock）全局解释器锁：同一时刻在一个进程只允许一个线程使用解释器，导致多线程无法真正用到多核。由于持有锁的线程在执行到I/O密集函数等一些等待操作时会自动释放GIL锁，所以对于I/O密集型服务来说，多线程是有效果的。但对于CPU密集型操作，由于每次只能有一个线程真正执行计算，对性能的影响可想而知。

这里必须说明的是，GIL并不是Python本身的缺陷，而是目前Python默认使用的CPython解析器引入的线程安全保护锁。我们一般说Python存在GIL锁，其实只针对于CPython解释器。那么如果我们能想办法避开GIL锁，是不是就能有很不错的加速效果？答案是肯定的，一种方案是改为使用其他解释器如pypy等，但对于成熟的C扩展库兼容不够好，维护成本高。另一种方案，就是通过C/C++扩展来封装计算密集部分代码，并在执行时移除GIL锁。

3.2 Python算法性能优化

pybind11就提供了在C++端手动释放GIL锁的接口，因此，我们只需要将密集计算的部分代码，改造成C++代码，并在执行前后分别释放/获取GIL锁，Python算法的多核计算能力就被解锁了。当然，除了显示调用接口释放GIL锁的方法之外，也可以在C++内部将计算密集型代码切换到其他C++线程异步执行，也同样可以规避GIL锁利用多核。

下面以100万次城市间球面距离计算为例，对比C++扩展前后性能差异：

C++端：

#include <math.h>
#include <stdio.h>
#include <time.h>
#include <pybind11/embed.h>


namespace py = pybind11;const double pi =3.1415926535897932384626433832795;

double rad(double d){return d * pi /180.0;}

double geo_distance(double lon1, double lat1, double lon2, double lat2, int test_cnt){py::gil_scoped_release release;// 释放GIL锁

    double a, b, s;
    double distance =0;for(int i =0; i < test_cnt; i++){
        double radLat1 =rad(lat1);
        double radLat2 =rad(lat2);
        a = radLat1 - radLat2;
        b =rad(lon1)-rad(lon2);
        s =pow(sin(a/2),2)+cos(radLat1)*cos(radLat2)*pow(sin(b/2),2);
        distance =2*asin(sqrt(s))*6378*1000;}py::gil_scoped_acquire acquire;// C++执行结束前恢复GIL锁return distance;}PYBIND11_MODULE(libcppex, m){
    m.def("geo_distance",&geo_distance,R"pbdoc(
        Compute geography distance between two places.)pbdoc");}

复制

Python调用端：

import sys
import time
import math
import threading

from libcppex import*

def rad(d):return d *3.1415926535897932384626433832795/180.0


def geo_distance_py(lon1, lat1, lon2, lat2, test_cnt):
    distance =0for i inrange(test_cnt):
        radLat1 =rad(lat1)
        radLat2 =rad(lat2)
        a = radLat1 - radLat2
        b =rad(lon1)-rad(lon2)
        s = math.sin(a/2)**2+ math.cos(radLat1)* math.cos(radLat2)* math.sin(b/2)**2
        distance =2* math.asin(math.sqrt(s))*6378*1000print(distance)return distance


def call_cpp_extension(lon1, lat1, lon2, lat2, test_cnt): 
    res =geo_distance(lon1, lat1, lon2, lat2, test_cnt)print(res)return res 


if __name__ =="__main__":
    threads =[]
    test_cnt =1000000
    test_type = sys.argv[1]
    thread_cnt =int(sys.argv[2])
    start_time = time.time()for i inrange(thread_cnt):if test_type =='p':
            t = threading.Thread(target=geo_distance_py, 
                args=(113.973129,22.599578,114.3311032,22.6986848, test_cnt,))
        elif test_type =='c':
            t = threading.Thread(target=call_cpp_extension, 
                args=(113.973129,22.599578,114.3311032,22.6986848, test_cnt,))
        threads.append(t)
        t.start()for thread inthreads:
        thread.join()print('calc time = %d'%int((time.time()- start_time)*1000))

复制

性能对比：

单线程时耗：Python 1500ms，C++ 8ms

10线程时耗：Python 15152ms，C++ 16ms

CPU利用率：

△ Python多线程无法同时刻多核并行计算，仅相当于单核利用率

并行计算，仅相当于单核利用率]

△ C++可以吃满DevCloud机器的10个CPU核

结论：

计算密集型代码，单纯改为C++实现即可获得不错的性能提升，在多线程释放GIL锁的加持下，充分利用多核，性能轻松获得线性加速比，大幅提升资源利用率。虽然实际场景中也可以用Python多进程的方式来利用多核，但是在模型越来越大动辄数十G的趋势下，内存占用过大不说，进程间频繁切换的context switching overhead，以及语言本身的性能差异，导致与C++扩展方式依然有不少差距。

（注：以上测试demo github地址：https://github.com/jesonxiang/cpp_extension_pybind11，测试环境为CPU 10核容器，大家有兴趣也可以做性能验证。）

3.3 编译环境

编译指令：

g++-Wall -shared -std=gnu++11-O2-fvisibility=hidden -fPIC -I./ perfermance.cc -o libcppex.so `Python3-config --cflags --ldflags --libs`

复制

如果Python环境未正确配置可能报错：

这里对Python的依赖是通过Python3-config --cflags --ldflags --libs来自动指定，可先单独运行此命令来验证Python依赖是否配置正确。Python3-config正常执行依赖Python3-dev，可以通过以下命令安装：

yum install Python3-devel

4. C++调Python

一般pybind11都是用于给C++代码封装Python端接口，但是反过来C++调Python也是支持的。只需#include <pybind11/embed.h>头文件即可使用，内部是通过嵌入CPython解释器来实现。使用上也非常简单易用，同时有不错的可读性，与直接调用Python接口非常类似。比如对一个numpy数组调用一些方法，参考示例如下：

// C++
pyVec = pyVec.attr("transpose")().attr("reshape")(pyVec.size());

复制

# Python
pyVec = pyVec.transpose().reshape(pyVec.size)

复制

以下以我们开发的C++ GPU高性能版抽帧so为例，除了提供抽帧接口给到Python端调用，还需要回调给Python从而通知抽帧进度以及帧数据。

Python端回调接口：

def on_decoding_callback(task_id:str,progress:int):print("decoding callback, task id: %s, progress: %d"%(task_id, progress))if __name__ =="__main__":
    decoder =DecoderWrapper()
    decoder.register_py_callback(os.getcwd()+"/decode_test.py","on_decoding_callback")

复制

C++端接口注册 & 回调Python：

#include <pybind11/embed.h>

int DecoderWrapper::register_py_callback(conststd::string &py_path,conststd::string &func_name){
        int ret =0;conststd::string &pyPath =py_get_module_path(py_path);conststd::string &pyName =py_get_module_name(py_path);SoInfo("get py module name: %s, path: %s", pyName.c_str(), pyPath.c_str());py::gil_scoped_acquire acquire;py::object sys = py::module::import("sys");
        sys.attr("path").attr("append")(py::str(pyPath.c_str()));//Python脚本所在的路径py::module pyModule = py::module::import(pyName.c_str());if(pyModule ==NULL){LogError("Failed to load pyModule ..");py::gil_scoped_release release;returnPYTHON_FILE_NOT_FOUND_ERROR;}if(py::hasattr(pyModule, func_name.c_str())){
            py_callback = pyModule.attr(func_name.c_str());}else{
            ret =PYTHON_FUNC_NOT_FOUND_ERROR;}py::gil_scoped_release release;return ret;}
    
int DecoderListener::on_decoding_progress(std::string &task_id, int progress){if(py_callback !=NULL){try{py::gil_scoped_acquire acquire;py_callback(task_id, progress);py::gil_scoped_release release;}catch(py::error_already_set const&PythonErr){LogError("catched Python exception: %s", PythonErr.what());}catch(conststd::exception &e){LogError("catched exception: %s", e.what());}catch(...){LogError("catched unknown exception");}}}

复制

5. 数据类型转换

5.1 类成员函数

对于类和成员函数的binding，首先需要构造对象，所以分为两步：第一步是包装实例构造方法，另一步是注册成员函数的访问方式。同时，也支持通过def_static、def_readwrite来绑定静态方法或成员变量，具体可参考官方文档3。

#include <pybind11/pybind11.h>classHello{public:Hello(){}voidsay(conststd::string s){std::cout << s << std::endl;}};PYBIND11_MODULE(py2cpp, m){
    m.doc()="pybind11 example";pybind11::class_<Hello>(m,"Hello").def(pybind11::init())//构造器，对应c++类的构造函数，如果没有声明或者参数不对，会导致调用失败.def("say",&Hello::say );}/* 
Python 调用方式：
c = py2cpp.Hello()
c.say()
*/

复制

5.2 STL容器

pybind11支持STL容器自动转换，当需要处理STL容器时，只要额外包括头文件<pybind11/stl.h>即可。pybind11提供的自动转换包括：std::vector<>/std::list<>/std::array<> 转换成 Python list ；std::set<>/std::unordered_set<> 转换成 Python set ; std::map<>/std::unordered_map<> 转换成dict等。此外 std::pair<> 和 std::tuple<>的转换也在 <pybind11/pybind11.h> 头文件中提供了。

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>classContainerTest{public:ContainerTest(){}voidSet(std::vector<int> v){
        mv = v;}private:std::vector<int> mv;};PYBIND11_MODULE(py2cpp, m){
    m.doc()="pybind11 example";pybind11::class_<ContainerTest>(m,"CTest").def( pybind11::init()).def("set",&ContainerTest::Set );}/* 
Python 调用方式：
c = py2cpp.CTest()
c.set([1,2,3])
*/

复制

5.3 bytes、string类型传递

由于在Python3中 string类型默认为UTF-8编码，如果从C++端传输string类型的protobuf数据到Python，则会出现 “UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte” 的报错。

解决方案：pybind11提供了非文本数据的binding类型py::bytes：

m.def("return_bytes",[](){std::string s("\xba\xd0\xba\xd0");// Not valid UTF-8returnpy::bytes(s);// Return the data without transcoding});

复制

5.4 智能指针

std::unique_ptrpybind11支持直接转换：

std::unique_ptr<Example>create_example(){returnstd::unique_ptr<Example>(newExample());}
m.def("create_example",&create_example);

复制

std::shared_ptr需要特别注意的是，不能直接使用裸指针。如下的get_child函数在Python端调用会报内存访问异常（如segmentation fault）。

classChild{};classParent{public:Parent():child(std::make_shared<Child>()){}
   Child *get_child(){return child.get();}/* Hint: ** DON'T DO THIS ** */private:std::shared_ptr<Child> child;};PYBIND11_MODULE(example, m){py::class_<Child,std::shared_ptr<Child>>(m,"Child");py::class_<Parent,std::shared_ptr<Parent>>(m,"Parent").def(py::init<>()).def("get_child",&Parent::get_child);}

复制

5.5 cv::Mat到numpy转换

抽帧结果返回给Python端时，由于目前pybind11暂不支持自动转换cv::Mat数据结构，因此需要手动处理C++ cv::Mat和Python端numpy之间的绑定。转换代码如下：

/*
 Python->C++ Mat
*/cv::Mat numpy_uint8_3c_to_cv_mat(py::array_t<uint8_t>& input){if(input.ndim()!=3)throwstd::runtime_error("3-channel image must be 3 dims ");py::buffer_info buf = input.request();cv::Mat mat(buf.shape[0], buf.shape[1],CV_8UC3,(uint8_t*)buf.ptr);return mat;}/*
 C++ Mat ->numpy
*/py::array_t<uint8_t>cv_mat_uint8_3c_to_numpy(cv::Mat& input){py::array_t<uint8_t> dst = py::array_t<uint8_t>({ input.rows,input.cols,3}, input.data);return dst;}

复制

5.6 zero copy

一般来说跨语言调用都产生性能上的overhead，特别是对于大数据块的传递。因此，pybind11也支持了数据地址传递的方式，避免了大数据块在内存中的拷贝操作，性能上提升很大。

classMatrix{public:Matrix(size_t rows, size_t cols):m_rows(rows),m_cols(cols){
        m_data =newfloat[rows*cols];}
    float *data(){return m_data;}
    size_t rows()const{return m_rows;}
    size_t cols()const{return m_cols;}private:
    size_t m_rows, m_cols;
    float *m_data;};py::class_<Matrix>(m,"Matrix",py::buffer_protocol()).def_buffer([](Matrix &m)-> py::buffer_info {returnpy::buffer_info(
            m.data(),/* Pointer to buffer */sizeof(float),/* Size of one scalar */py::format_descriptor<float>::format(),/* Python struct-style format descriptor */2,/* Number of dimensions */{ m.rows(), m.cols()},/* Buffer dimensions */{sizeof(float)* m.cols(),/* Strides (in bytes) for each index */sizeof(float)});});

复制

6. 落地 & 行业应用

上述方案，我们已在广告多媒体AI的色彩提取相关服务、GPU高性能抽帧等算法中落地，取得了非常不错的提速效果。业内来说，目前市面上大部分AI计算框架，如TensorFlow、Pytorch、阿里X-Deep Learning、百度PaddlePaddle等，均使用pybind11来提供C++到Python端接口封装，其稳定性以及性能均已得到广泛验证。

7. 结语

在AI领域普遍开源节流、降本提效的大背景下，如何充分利用好现有资源，提升资源利用率是关键。本文提供了一个非常便捷的提升Python算法服务性能，以及CPU利用率的解决方案，并在线上取得了不错的效果。除此之外，腾讯内部也有一些其他Python加速方案，比如目前TEG的编译优化团队正在做Python解释器的优化工作，后续也可以期待一下。

8. 附录

1(https://docs.Python.org/3/extending/index.html#extending-index)

2(https://github.com/tensorflow/community/blob/master/rfcs/20190208-pybind11.md#replace-swig-with-pybind11)

3(https://pybind11.readthedocs.io/en/stable/advanced/cast/index.html)

黑贝是条狗

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
C++ 与 Python相互调用

pybind11就提供了在C++端手动释放GIL锁的接口，因此，我们只需要将密集计算的部分代码，改造成C++代码，并在执行前后分别释放/获取GIL锁，Python算法的多核计算能力就被解锁了。Pybind11 通过 C++ 编译时的自省来推断类型信息，来最大程度地减少传统拓展 Python 模块时繁杂的样板代码，且实现了常见数据类型，如 STL 数据结构、智能指针、类、函数重载、实例方法等到Python的自动转换，其中函数可以接收和返回自定义数据类型的值、指针或引用。
复制链接

扫一扫