python打包dll被c调用_通过ctypes调用dll实现Python和c++混合编程

最新推荐文章于 2025-01-09 11:33:12 发布

weixin_39597399

最新推荐文章于 2025-01-09 11:33:12 发布

阅读量708

点赞数

文章标签： python打包dll被c调用

0.问题描述

Python是一个适合算法验证的语言，其优点与缺点都十分明显，无需编译运行十分“顺滑”，但是运行效率较低。

之前利用Python编写了多晶塑性计算程序，比较复杂的算法和功能在很短的时间内可以开发完成，然而运行起来很令人头疼，一个算例需要计算近四十个小时，无法忍受需要优化程序。当然，优化程序的方法和方向有很多，比如并行计算，比如优化Python代码以多利用numpy包，比如利用numba等。在不改变算法的情况下，换一个运行效率高的需要编译的语言是个直接的思路，比如C++。基于Python编写的前后处理程序自认为很好用，不是大团队也不是专业的程序员的情况下，不想花大量时间用c++重构全部的程序，那就只改一改最耗时的计算内核。

利用C++编写程序中的迭代步计算内核，然后用Python封装，是能想到的适合的方法。解决方案为：使用C++开发计算内核的dll动态链接库，python中利用ctypes调用dll，利用Python编写总体计算程序。

下面给出一个非常简单的例子，和问题描述没什么直接关系-__-#，反正就是一个ctypes的小栗子。

1.利用Visual Studio开发C++的64位dll

实现一个特别简单的功能，输入一个数组double[9]作为一个3x3矩阵，输出一个struct，包含一个double作为该矩阵的行列式，包含一个double[9]，作为该矩阵的逆矩阵。

首先打开VS(这里是VS2019 社区版)，创建新项目->(C++,windows)动态链接库->命名并创建。

向项目中添加头文件nrtools.h和ludcmp.c其中前者主要定义了相关的向量、矩阵和ludcmp函数等，具体可以参考Numerical Recipes in C++，而后者是生成dll的关键，其中定义了dll的输入数据结构，输出数据结构和接口函数：

#pragma once

#ifdef LUDCMPDLL_EXPORTS#define LUDCMP_API __declspec(dllexport)#else#define LUDCMP_API __declspec(dllimport)#endif

extern "C" {

//input data structuretypedef struct InputData {

int flagValue;

double oriMat[9]; //the flatten matrix to be deal with}IPDATA;

//output data structuretypedef struct OutputData {

double determin; //determinant of the matrix double invMat[9]; //inverse of the matrix}OTDATA;

LUDCMP_API OTDATA interface(const IPDATA& fish);

}

其中3~7行是MSVC必须的，12-21定义了输入输出数据结构，而23行是接口函数原型。接口函数在ludcmp.cpp文件中定义，需在项目中添加该源文件。

#include "pch.h"#include "ludcmp.h"#include "nrtools.h"

//function used within this cpp---------------------------------void mat2array(const MatDoub& mat3x3, double(&ary9)[9]);

//--------------------------------------------------------------

OTDATA interface(const IPDATA& fish) {

// MatDoub_I aa(3, 3, fish.oriMat);

// LUdcmp aadcmp(aa);

MatDoub_O aainv(3, 3);

aadcmp.inverse(aainv); //aainv restore the inverse of aa // OTDATA results;

// first property results.determin = (fish.flagValue > 0) ? aadcmp.det() : -aadcmp.det();

// second property mat2array(aainv, results.invMat);

// return results;

}

//function only used in this file-----------------------------------------/* turn a define MatDoub class,3x3 to a normal double [9]*/

void mat2array(const MatDoub& mat3x3, double(&ary9)[9]) {

for (int i = 0; i < 3; i++) {

for (int j = 0; j < 3; j++) {

ary9[i * 3 + j] = mat3x3[i][j];

}

非常简单，主要是利用了Numerical Recipes中的LU分解算法，利用LUDCMP类来实现求逆求行列式的。这里21利用了一个小函数将3x3的矩阵转化成我们需要的数组用来输出。

代码全部码完后，利用VS的x64编译器，解决方案选“Debug"-"x64”，点击“生成”-“生成解决方案”，若无错误，将在./x64/Debug路径下生成所需的ludcmpDll.dll和ludcmpDll.lib文件，这就是所需的动态链接库。因为本人的python是64位的，所以要生成64位dll，大家情况不同，要注意版本匹配。

2.利用ctypes在python中调用dll

ctypes是python的标准包(个人使用anaconda3)。可以利用ctypes调用dll，并使用dll中定义的接口函数进行计算。这里文件的结构如下，clibraty文件夹中放进了需要调用的dll文件，编写python脚本调用

|--test.py

|--clibray/

|--ludcmpDLL.dll

|--ludcmpDll.lib

引用ctypes

import ctypes

在使用dll中的接口函数前，需要完成三件事情1)载入dll，2)结构体声明，3)类似C++中的函数原型声明，若是没有使用结构体或联合体等复杂数据结构，可以忽略2)。

#--1)--调用dll

ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")

#--2)--定义结构体

#输入结构体

class IPDATA(ctypes.Structure):

_fields_ = [

("flagValue",ctypes.c_int),

("oriMat",ctypes.c_double*9)

]

#输出结构体

class OTDATA(ctypes.Structure):

_fields_ = [

("determin",ctypes.c_double),

("invMat",ctypes.c_double*9)

]

#--3)--函数声明 as in c++ head file, ludcmp就是我们要用的interface函数了

ludcmp = ludcmpCpp.interface

ludcmp.argtypes = [IPDATA]

ludcmp.restype = OTDATA

这里面可以看到ctypes中定义结构体，其实生成一个ctypes中Structure类的派生类。注意函数的输入是一个IPDATA类型。举个例子，如果想算一个3x3的numpy.narray的逆矩阵和行列式，要把该矩阵转换成IPDATA类型：

mat = np.array([[2.0,0.0,0.0],

[0.0,1.0,0.0],

[0.0,0.0,1.0]])

aa = IPDATA()

aa.flagValue = 20 #any intergal greater than 0

aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())

然后使用dll中的interface函数计算出一个OTDATA，最后将OTDATA中的属性转换为所需数据：

bb = ludcmp(aa) #bb is a OTDATA

matInv = np.array(bb.invMat).reshape(3,3) #逆矩阵，存为3x3

detmin = bb.determin #行列式

3.测试

求一个矩阵的逆矩阵和行列式，numpy里面就有很好用的函数：

numpy.linalg.inv()

numpy.linalg.det()

做一个测试函数，计算3x3矩阵的逆矩阵和行列式。使用两种方法，一种是利用上文提到的调用dll的方式，另一种是利用numpy的linalg，两种算法封装为函数：

import ctypes

import numpy as np

#define structure------------------------------

class IPDATA(ctypes.Structure):

_fields_ = [

("flagValue",ctypes.c_int),

("oriMat",ctypes.c_double*9)

]

class OTDATA(ctypes.Structure):

_fields_ = [

("determin",ctypes.c_double),

("invMat",ctypes.c_double*9)

]

def fun1(mat):

# use dll to calculate inverse and determinant

#----open dll

ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")

#----declaim function as in c++ head file

ludcmp = ludcmpCpp.interface

ludcmp.argtypes = [IPDATA]

ludcmp.restype = OTDATA

#----prepare for input

aa = IPDATA()

aa.flagValue = 20 #any intergal greater than 0

aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())

#----call interface function in dll

bb = ludcmp(aa) #bb is a OTDATA

return np.array(bb.invMat).reshape(3,3),bb.determin

def fun2(mat):

# use numpy linalg package

matInv = np.linalg.inv(mat)

determ = np.linalg.det(mat)

return matInv,determ

编写一个测试函数，分别调用fun1和fun2，算三个矩阵的逆矩阵和行列式，为体现性能差异，每个矩阵运行10000次。测试函数为：

def testfun(mat1,mat2,mat3):

#run n times for each mat

n = 10000

for i in range(n):

inv1,det1 = fun1(mat1)

inv2,det2 = fun1(mat2)

inv3,det3 = fun1(mat3)

for i in range(n):

inv1,det1 = fun2(mat1)

inv2,det2 = fun2(mat2)

inv3,det3 = fun2(mat3)

return None

利用cProfile测试一番：

import cProfile as cp

mat1 = np.array([[2.0,0.0,0.0], #traction along 1

[0.0,1.0,0.0],

[0.0,0.0,1.0]])

mat2 = np.array([[1.0,2.0,0.0], #shear in 12

[0.0,1.0,0.0],

[0.0,0.0,1.0]])

mat3 = np.array([[1.0,3.0,0.0], #shear in 12 and traction along 2

[0.0,0.5,0.0],

[0.0,0.0,1.0]])

cp.run("testfun(mat1,mat2,mat3)")

输出(截取了有用的部分)

1620004 function calls in 13.609 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

......

30000 0.142 0.000 0.309 0.000 linalg.py:482(inv)

30000 0.743 0.000 12.892 0.000 test.py:25(fun1)

30000 0.033 0.000 0.668 0.000 test.py:54(fun2)

1 0.049 0.049 13.608 13.608 test.py:61(testfun)

30000 11.142 0.000 11.142 0.000 {built-in method _ctypes.LoadLibrary}

......

令人伤心的结果，一共运行13.6秒，fun1自己就用了12.9秒，所以能用numpy还是用numpy吧，毕竟np。

wait，wait，wait，进一步分析，怀疑真正耗时的是ctypes中的LoadLibrary，每次运行fun1都要载入一次dll，那确实浪费时间。可以进一步修改fun1，让其只负责算，不负责载入dll，载入好的函数作为参数传递过来。

def fun1plus(mat,func):

#----prepare for input

aa = IPDATA()

aa.flagValue = 20 #any intergal greater than 0

aa.oriMat = (ctypes.c_double*9)(*mat.ravel().tolist())

#----call interface function in dll

bb = func(aa) #bb is a OTDATA

return np.array(bb.invMat).reshape(3,3),bb.determin

测试函数改为只载入一次dll，然后再循环30000次interface函数计算：

def testfun2(mat1,mat2,mat3):

#run n times for each mat

n = 10000

# load in dll

ludcmpCpp = ctypes.CDLL("./clibrary/ludcmpDll")

#----define function as in c++ head file

ludcmp = ludcmpCpp.interface

ludcmp.argtypes = [IPDATA]

ludcmp.restype = OTDATA

for i in range(n):

inv1,det1 = fun1plus(mat1,ludcmp)

inv2,det2 = fun1plus(mat2,ludcmp)

inv3,det3 = fun1plus(mat3,ludcmp)

for i in range(n):

inv1,det1 = fun2(mat1)

inv2,det2 = fun2(mat2)

inv3,det3 = fun2(mat3)

return None

看一下这次的测试结果：cp.run("testfun2(mat1,mat2,mat3)")

1320014 function calls in 1.066 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)

......

30000 0.137 0.000 0.279 0.000 linalg.py:482(inv)

30000 0.375 0.000 0.435 0.000 test.py:42(fun1plus)

30000 0.029 0.000 0.591 0.000 test.py:54(fun2)

1 0.030 0.030 1.066 1.066 test.py:76(testfun2)

1 0.010 0.010 0.010 0.010 {built-in method _ctypes.LoadLibrary}

......

果然如此，载入一次dll，要花掉0.01秒，，然后用dll中的函数算30000次。对比发现，fun1plus比fun2算的时间还要少一些，那用c++和python混合编程的方式还是可以的。。。除了c++比较难码-__-

ps：

计算多晶时，一个多晶需要包含大量单晶信息，比如Taylor假设，此时一个单晶的计算核心可以使用c++编写dll来计算，这样在计算多晶时，首先载入dll，然后循环晶粒个数次单晶计算，这样比全python编程快很多，看情况吧，个人情况是使用dll运算速度快了几百倍。

查资料过程中，很多帖子说numba无脑提速，感觉以后有时间了可以好好试一试，简单上手了一下，感觉这个东西不是那么无脑，脾气很大。