c++ 编译添加dll_通过ctypes调用dll实现Python和c++混合编程

最新推荐文章于 2025-01-09 11:33:12 发布

最新推荐文章于 2025-01-09 11:33:12 发布 · 402 阅读

文章标签：

#c++ 编译添加dll #c++6.0 图标封装dll #C++如何测试dll #c++调用dll #cadlib不用dll怎么直接调用里面的函数 #cpp 编译dll

本文介绍如何使用Python调用C++编写的动态链接库（DLL），以提高算法执行效率。通过具体示例展示了从DLL开发到Python调用的全过程，并讨论了性能优化技巧。

部署运行你感兴趣的模型镜像

0.问题描述

Python是一个适合算法验证的语言，其优点与缺点都十分明显，无需编译运行十分“顺滑”，但是运行效率较低。

之前利用Python编写了多晶塑性计算程序，比较复杂的算法和功能在很短的时间内可以开发完成，然而运行起来很令人头疼，一个算例需要计算近四十个小时，无法忍受需要优化程序。当然，优化程序的方法和方向有很多，比如并行计算，比如优化Python代码以多利用numpy包，比如利用numba等。在不改变算法的情况下，换一个运行效率高的需要编译的语言是个直接的思路，比如C++。基于Python编写的前后处理程序自认为很好用，不是大团队也不是专业的程序员的情况下，不想花大量时间用c++重构全部的程序，那就只改一改最耗时的计算内核。

利用C++编写程序中的迭代步计算内核，然后用Python封装，是能想到的适合的方法。解决方案为：使用C++开发计算内核的dll动态链接库，python中利用ctypes调用dll，利用Python编写总体计算程序。

下面给出一个非常简单的例子，和问题描述没什么直接关系-__-#，反正就是一个ctypes的小栗子。

1.利用Visual Studio开发C++的64位dll

实现一个特别简单的功能，输入一个数组double[9]作为一个3x3矩阵，输出一个struct，包含一个double作为该矩阵的行列式，包含一个double[9]，作为该矩阵的逆矩阵。

这里利用visual studio的dll模板直接开发。参考https://docs.microsoft.com/en-us/cpp/build/walkthrough-creating-and-using-a-dynamic-link-library-cpp?view=vs-2019。

首先打开VS(这里是VS2019 社区版)，创建新项目->（C++,windows）动态链接库->命名并创建。

向项目中添加头文件nrtools.h和ludcmp.c其中前者主要定义了相关的向量、矩阵和ludcmp函数等，具体可以参考Numerical Recipes in C++，而后者是生成dll的关键，其中定义了dll的输入数据结构，输出数据结构和接口函数：

#pragma once

#ifdef LUDCMPDLL_EXPORTS
#define LUDCMP_API __declspec(dllexport)
#else
#define LUDCMP_API __declspec(dllimport)
#endif

extern "C" {

//input data structure
typedef struct InputData {
    int flagValue;
    double oriMat[9];  //the flatten matrix to be deal with
}IPDATA;

//output data structure
typedef struct OutputData {
    double determin;    //determinant of the matrix
    double invMat[9];   //inverse of the matrix
}OTDATA;

LUDCMP_API OTDATA interface(const IPDATA& fish);

}

其中3~7行是MSVC必须的，12-21定义了输入输出数据结构，而23行是接口函数原型。接口函数在ludcmp.cpp文件中定义，需在项目中添加该源文件。

#include "pch.h"
#include "ludcmp.h"
#include "nrtools.h"

//function used within this cpp---------------------------------
void mat2array(const MatDoub& mat3x3, double(&ary9)[9]);
//--------------------------------------------------------------

OTDATA interface(const IPDATA& fish) {
    //
    MatDoub_I aa(3, 3, fish.oriMat);
    //
    LUdcmp aadcmp(aa);
    MatDoub_O aainv(3, 3);
    aadcmp.inverse(aainv);          //aainv restore the inverse of aa
    //
    OTDATA results;
    // first property
    results.determin = (fish.flagValue > 0) ? aadcmp.det() : -aadcmp.det();
    // second property
    mat2array(aainv, results.invMat);
    //
    return results;
}

//function only used in this file-----------------------------------------
/* turn a define MatDoub class,3x3 to a normal double [9]
*/
void mat2array(const MatDoub& mat3x3, double(&ary9)[9]) {
    for (int i = 0; i < 3; i++) {
        for (int j = 0; j < 3; j++) {
            ary9[i * 3 + j] = mat3x3[i][j];
        }
    }
}

非常简单，主要是利用了Numerical Recipes中的LU分解算法，利用LUDCMP类来实现求逆求行列式的。这里21利用了一个小函数将3x3的矩阵转化成我们需要的数组用来输出。

代码全部码完后，利用VS的x64编译器，解决方案选“Debug"-"x64”，点击“生成”-“生成解决方案”，若无错误，将在./x64/Debug路径下生成所需的ludcmpDll.dll和ludcmpDll.lib文件，这就是所需的动态链接库。因为本人的python是64位的，所以要生成64位dll，大家情况不同，要注意版本匹配。

2.利用ctypes在python中调用dll

ctypes是python的标准包（个人使用anaconda3）。可以利用ctypes调用dll，并使用dll中定义的接口函数进行计算。这里文件的结构如下，clibraty文件夹中放进了需要调用的dll文件，编写python脚本调用

./
|--test.py
|--clibray/
        |--ludcmpDLL.dll
        |--ludcmpDll.lib

引用ctypes

import ctypes

在使用dll中的接口函数前，需要完成三件事情1）载入dll，2）结构体声明，3）类似C++中的函数原型声明，若是没有使用结构体或联合体等复杂数据结构，可以忽略2）。

#--1)--调用dll
ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")

#--2)--定义结构体
#输入结构体
class IPDATA(ctypes.Structure):
    _fields_ = [
        ("flagValue",ctypes.c_int),
        ("oriMat",ctypes.c_double*9)
        ]
#输出结构体
class OTDATA(ctypes.Structure):
    _fields_ = [
        ("determin",ctypes.c_double),
        ("invMat",ctypes.c_double*9)
        ]

#--3)--函数声明 as in c++ head file, ludcmp就是我们要用的interface函数了
    ludcmp = ludcmpCpp.interface
    ludcmp.argtypes = [IPDATA]
    ludcmp.restype = OTDATA

这里面可以看到ctypes中定义结构体，其实生成一个ctypes中Structure类的派生类。注意函数的输入是一个IPDATA类型。举个例子，如果想算一个3x3的numpy.narray的逆矩阵和行列式，要把该矩阵转换成IPDATA类型：

mat = np.array([[2.0,0.0,0.0],
                [0.0,1.0,0.0],
                [0.0,0.0,1.0]])
#
aa = IPDATA()
aa.flagValue = 20   #any intergal greater than 0
aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())

然后使用dll中的interface函数计算出一个OTDATA，最后将OTDATA中的属性转换为所需数据：

bb = ludcmp(aa)     #bb is a OTDATA
matInv = np.array(bb.invMat).reshape(3,3)   #逆矩阵，存为3x3
detmin = bb.determin                        #行列式

3.测试

求一个矩阵的逆矩阵和行列式，numpy里面就有很好用的函数：

numpy.linalg.inv()
numpy.linalg.det()

做一个测试函数，计算3x3矩阵的逆矩阵和行列式。使用两种方法，一种是利用上文提到的调用dll的方式，另一种是利用numpy的linalg，两种算法封装为函数：

import ctypes
import numpy as np

#define structure------------------------------
class IPDATA(ctypes.Structure):
    _fields_ = [
        ("flagValue",ctypes.c_int),
        ("oriMat",ctypes.c_double*9)
        ]

class OTDATA(ctypes.Structure):
    _fields_ = [
        ("determin",ctypes.c_double),
        ("invMat",ctypes.c_double*9)
        ]

def fun1(mat):
    # use dll to calculate inverse and determinant
    #----open dll
    ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")
    #----declaim function as in c++ head file
    ludcmp = ludcmpCpp.interface
    ludcmp.argtypes = [IPDATA]
    ludcmp.restype = OTDATA
    #----prepare for input
    aa = IPDATA()
    aa.flagValue = 20   #any intergal greater than 0
    aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())
    #----call interface function in dll
    bb = ludcmp(aa)     #bb is a OTDATA
    #
    return np.array(bb.invMat).reshape(3,3),bb.determin

def fun2(mat):
    # use numpy linalg package
    matInv = np.linalg.inv(mat)
    determ = np.linalg.det(mat)
    return matInv,determ

编写一个测试函数，分别调用fun1和fun2，算三个矩阵的逆矩阵和行列式，为体现性能差异，每个矩阵运行10000次。测试函数为：

def testfun(mat1,mat2,mat3):
    #run n times for each mat
    n = 10000
    for i in range(n):
        inv1,det1 = fun1(mat1)
        inv2,det2 = fun1(mat2)
        inv3,det3 = fun1(mat3)
    #
    for i in range(n):
        inv1,det1 = fun2(mat1)
        inv2,det2 = fun2(mat2)
        inv3,det3 = fun2(mat3)        
    #
    return None

利用cProfile测试一番：

import cProfile as cp
mat1 = np.array([[2.0,0.0,0.0],         #traction along 1
                 [0.0,1.0,0.0],
                 [0.0,0.0,1.0]])
mat2 = np.array([[1.0,2.0,0.0],         #shear in 12
                 [0.0,1.0,0.0],
                 [0.0,0.0,1.0]])
mat3 = np.array([[1.0,3.0,0.0],         #shear in 12 and traction along 2
                 [0.0,0.5,0.0],
                 [0.0,0.0,1.0]])
#
cp.run("testfun(mat1,mat2,mat3)")

输出(截取了有用的部分)

1620004 function calls in 13.609 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    ......
    ......
    30000    0.142    0.000    0.309    0.000 linalg.py:482(inv)
    30000    0.743    0.000   12.892    0.000 test.py:25(fun1)
    30000    0.033    0.000    0.668    0.000 test.py:54(fun2)
        1    0.049    0.049   13.608   13.608 test.py:61(testfun)
    30000   11.142    0.000   11.142    0.000 {built-in method _ctypes.LoadLibrary}
    ......
    ......

令人伤心的结果，一共运行13.6秒，fun1自己就用了12.9秒，所以能用numpy还是用numpy吧，毕竟np。

wait，wait，wait，进一步分析，怀疑真正耗时的是ctypes中的LoadLibrary，每次运行fun1都要载入一次dll，那确实浪费时间。可以进一步修改fun1，让其只负责算，不负责载入dll，载入好的函数作为参数传递过来。

def fun1plus(mat,func):
    #----prepare for input
    aa = IPDATA()
    aa.flagValue = 20   #any intergal greater than 0
    aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())
    #----call interface function in dll
    bb = func(aa)     #bb is a OTDATA
    #
    return np.array(bb.invMat).reshape(3,3),bb.determin

测试函数改为只载入一次dll，然后再循环30000次interface函数计算：

def testfun2(mat1,mat2,mat3):
    #run n times for each mat
    n = 10000
    # load in dll
    ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")
    #----define function as in c++ head file
    ludcmp = ludcmpCpp.interface
    ludcmp.argtypes = [IPDATA]
    ludcmp.restype = OTDATA
    for i in range(n):
        inv1,det1 = fun1plus(mat1,ludcmp)
        inv2,det2 = fun1plus(mat2,ludcmp)
        inv3,det3 = fun1plus(mat3,ludcmp)
    #

    for i in range(n):
        inv1,det1 = fun2(mat1)
        inv2,det2 = fun2(mat2)
        inv3,det3 = fun2(mat3)        
    #
    return None

看一下这次的测试结果：cp.run("testfun2(mat1,mat2,mat3)")

1320014 function calls in 1.066 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    ......
    ......
    30000    0.137    0.000    0.279    0.000 linalg.py:482(inv)
    30000    0.375    0.000    0.435    0.000 test.py:42(fun1plus)
    30000    0.029    0.000    0.591    0.000 test.py:54(fun2)
        1    0.030    0.030    1.066    1.066 test.py:76(testfun2)
        1    0.010    0.010    0.010    0.010 {built-in method _ctypes.LoadLibrary}
    ......

果然如此，载入一次dll，要花掉0.01秒，，然后用dll中的函数算30000次。对比发现，fun1plus比fun2算的时间还要少一些，那用c++和python混合编程的方式还是可以的。。。除了c++比较难码-__-

ps：

计算多晶时，一个多晶需要包含大量单晶信息，比如Taylor假设，此时一个单晶的计算核心可以使用c++编写dll来计算，这样在计算多晶时，首先载入dll，然后循环晶粒个数次单晶计算，这样比全python编程快很多，看情况吧，个人情况是使用dll运算速度快了几百倍。

查资料过程中，很多帖子说numba无脑提速，感觉以后有时间了可以好好试一试，简单上手了一下，感觉这个东西不是那么无脑，脾气很大。

您可能感兴趣的与本文相关的镜像