起因
我在尝试一篇目标识别的论文的开源代码时,发现由于其NMS、LSTM等模块是使用c++编译的(为了弥补python速度的短板),对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像,不过Faster-rcnn官方更新了其包含的c++文件,从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本,于是尝试重写c++/cuda 扩展的模块
prerequisite
CUDA10.0
pytorch1.4+cu100
torchvision0.5.0+cu100
setuptools 52.0.0
方法
由于原始代码采用较老的方式(THC模块)实现扩展,这里尝试pytorch官方 tutorial的方法实现
https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-c-extension
然而,这教程有一个很致命的问题,按照它的方法用不了,被坑了。所以…先看看作为参考吧。
概述
以NMS模块为例,文件树如下:
|project name
|---cuda
| |---nms_kernel.cu
| |---nms_kernel.h
|---nms_cuda.cpp
|---nms_cuda.h
|---build.py
其中nms_kernel.cu包括cuda核函数和调用它的函数A,nms_cuda.c又调用A函数,最后通过build.py编译扩展模块
##build.py
import os
import torch
from setuptools import setup, Extension
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from setuptools import setup, Extension
from torch.utils import cpp_extension
# Might have to export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} ?
torch._C._GLIBCXX_USE_CXX11_ABI=0 //设置编译的参数
sources = []
headers = []
defines = []
inl = []
with_cuda = False
if torch.cuda.is_available():
print('Including CUDA code.')
sources += ['src/cuda/nms_kernel.cu','src/nms_cuda.cpp'] //待编译的c与cu文件
# sources += ['src/nms_cuda.cpp']
headers += ['src/','src/cuda/'] //头文件目录
defines += [('WITH_CUDA', None)]
with_cuda = True
inl +=['src/','src/cuda/']
this_file = os.path.dirname(os.path.realpath(__file__))
print(this_file)
setup(
name='ext.nms', //拓展包名,不太重要
ext_modules=[
CUDAExtension('extnms',sources=sources,include_dirs= headers ,
define_macros=defines,
extra_compile_flag=('-std=c++11'),//使用c11编译
with_cuda=with_cuda
)],
cmdclass={
'build_ext': BuildExtension
}
)
考虑到有人是源码编译pytorch,那么就需要注意,编译pytorch时的-D GLIBCXX_USE_CXX11_ABI
标志位,extention编译的标志位要和pytorch编译的一致,一般来说,pip安装不用考虑这个问题
cpp文件
该文件主要特点是:使用torch::Tensor代替原本的指针、数组类型(当然可以尝试指定数据类型,例如torch.IntTensor
)
宏PYBIND11_MODULE
中第一个参数即函数名,第二个是该函数指针地址,最后一个参数是模块名(用于import …)
#include pip
#include <math.h>
#include "cuda/nms_kernel.h" //cuda核函数的头文件
#include <torch/extension.h>
#include <pybind11/pybind11.h>
namespace py = pybind11; //用于c++到python的转换
extern THCState *state;
int ApplyNMSGPU( //声明cuda函数,不要也可
torch::Tensor keep_out,
torch::Tensor boxes_dev,
int boxes_num,
torch::Tensor nms_overlap_thresh,
int device_id);
int nms_apply( //需要转换陈python的函数,里面调用了cuda函数ApplyNMSGPU
torch::Tensor keep,
torch::Tensor boxes_sorted,
float nms_thresh
)
{
torch::Tensor keep_data = keep;
torch::Tensor boxes_sorted_data = boxes_sorted;
int boxes_num = boxes_sorted.size(0);
int devId = boxes_sorted.get_device();
return ApplyNMSGPU(keep_data, boxes_sorted_data, boxes_num, nms_thresh, devId);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
m.def("nms_apply", &nms_apply, "nms");
}
cpp的头文件
#include <torch/extension.h>
int nms_apply(
torch::Tensor keep,
torch::Tensor boxes_sorted,
float nms_thresh
);
cuda文件
你只需要关注:
·数据类型:scalar_t、scalar_t*、size_t
他们都是模板数据类型,在后面确定具体是什么。
·宏AT_DISPATCH_ALL_TYPE
,参数一:决定scalar_t是哪种具体数据类型,参数三:lambla函数,注意核函数的参数数据类型,对于torch:tensor的,需要.data<scalar_t>()
,但是,如果是别的类型(例如指针)还用.data<scalar_t>()
的话会报错:error: expression must have class type
//#include <ATen/ATen.h>
#include <vector>
#include <iostream>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(condition) \
/* Code block avoids redefinition of cudaError_t error */ \
do { \
cudaError_t error = condition; \
if (error != cudaSuccess) { \
std::cout << cudaGetErrorString(error) << std::endl; \
} \
} while (0)
#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
const long threadsPerBlock = sizeof(unsigned long long) * 8;
/*
__device__ inline float devIoU(scalar_t* a, scalar_t* b) {
float left = max(a[0], b[0]), right = min(a[2], b[2]);
float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
float interS = width * height;
float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
return interS / (Sa + Sb - interS);
}
*/
cxg script/
template <typename scalar_t>
__device__ __forceinline__ scalar_t devIoU( scalar_t* a, scalar_t* b) {
float left = max(a[0], b[0]), right = min(a[2], b[2]);
float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
float interS = width * height;
float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
return interS / (Sa + Sb - interS);
}
//
///cxg script//
template <typename scalar_t>
__global__ void nms_kernel(
const size_t __restrict__ n_boxes,
scalar_t__restrict__ nms_overlap_thresh,
scalar_t* __restrict__ dev_boxes,
unsigned long long* __restrict__ dev_mask
) {
const int row_start = blockIdx.y;
const int col_start = blockIdx.x;
// if (row_start > col_start) return;
const int row_size =
min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
const int col_size =
min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);
__shared__ scalar_t block_boxes[threadsPerBlock * 5];
if (threadIdx.x < col_size) {
block_boxes[threadIdx.x * 4 + 0] =
dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];
block_boxes[threadIdx.x * 4 + 1] =
dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];
block_boxes[threadIdx.x * 4 + 2] =
dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];
block_boxes[threadIdx.x * 4 + 3] =
dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];
}
__syncthreads();
if (threadIdx.x < row_size) {
const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
scalar_t* cur_box = dev_boxes + cur_box_idx * 4;
int i = 0;
unsigned long long t = 0;
int start = 0;
if (row_start == col_start) {
start = threadIdx.x + 1;
}
for (i = start; i < col_size; i++) {
if (devIoU(cur_box, block_boxes + i * 4) > nms_overlap_thresh) {
t |= 1ULL << i;
}
}
const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
dev_mask[cur_box_idx * col_blocks + col_start] = t;
}
}
//
void _set_device(int device_id) {
int current_device;
CUDA_CHECK(cudaGetDevice(¤t_device));
if (current_device == device_id) {
return;
}
// The call to cudaSetDevice must come before any calls to Get, which
// may perform initialization using the GPU.
CUDA_CHECK(cudaSetDevice(device_id));
}
cxg script ///
int ApplyNMSGPU(
torch::Tensor keep_out,
torch::Tensor boxes_dev,
int boxes_num,
float nms_overlap_thresh,
int device_id) {
_set_device(device_id);
unsigned long long* mask_dev = NULL;
const auto col_blocks = DIVUP(boxes_num, threadsPerBlock);
CUDA_CHECK(cudaMalloc(&mask_dev,
boxes_num * col_blocks * sizeof(unsigned long long)));
dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
DIVUP(boxes_num, threadsPerBlock));
dim3 threads(threadsPerBlock);
AT_DISPATCH_ALL_TYPES(
boxes_dev.type(), "ApplyNMSGPU", ([&] {
nms_kernel<scalar_t><<<blocks, threads>>>(
boxes_num,
nms_overlap_thresh,
boxes_dev.data<scalar_t>(),
mask_dev
);
}));
std::vector<unsigned long long> mask_host(boxes_num * col_blocks);
CUDA_CHECK(cudaMemcpy(&mask_host[0],
mask_dev,
sizeof(unsigned long long) * boxes_num * col_blocks,
cudaMemcpyDeviceToHost));
std::vector<unsigned long long> remv(col_blocks);
memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);
int num_to_keep = 0;
for (int i = 0; i < boxes_num; i++) {
int nblock = i / threadsPerBlock;
int inblock = i % threadsPerBlock;
if (!(remv[nblock] & (1ULL << inblock))) {
keep_out[num_to_keep++] = i;
auto p = &mask_host[0] + i * col_blocks;
for (int j = nblock; j < col_blocks; j++) {
remv[j] |= p[j];
}
}
}
CUDA_CHECK(cudaFree(mask_dev));
return num_to_keep;
}
/
cuda头文件
#include <torch/extension.h>
int ApplyNMSGPU(
torch::Tensor keep_out,
torch::Tensor boxes_dev,
int boxes_num,
float nms_overlap_thresh,
int device_id);
编译:
命令:python build.py install
结果:你可以看到所有的nvcc gcc编译器的编译参数,确定是否正确
building 'extnms' extension
/usr/local/cuda/bin/nvcc -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/cuda/nms_kernel.cu -o build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/nms_cuda.cpp -o build/temp.linux-x86_64-3.8/src/nms_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o build/temp.linux-x86_64-3.8/src/nms_cuda.o -L/usr/local/cuda/lib64 -lcudart -o build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for extnms.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/extnms.py to extnms.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.extnms.cpython-38: module references __file__
creating 'dist/ext.nms-0.0.0-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing ext.nms-0.0.0-py3.8-linux-x86_64.egg
removing '/usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Extracting ext.nms-0.0.0-py3.8-linux-x86_64.egg to /usr/local/lib/python3.8/dist-packages
ext.nms 0.0.0 is already the active version in easy-install.pth
Installed /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Processing dependencies for ext.nms==0.0.0
Finished processing dependencies for ext.nms==0.0.0
调用
import pytorch
import extnms
…
小结
我哼多天卡在类似lltm_cpp.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14NMSAPPLYGPUXXXX
这种错误上,这种错误就是编译时,库使用了未知的函数,在这个项目里,体现在必须使用.h文件,然鹅官方教程就是没有用.h文件,导致这个错误
运行setup.py时报错:
ValueError: bad marshal data (unknown type code)
solusion:
pip3 install --upgrade --force-reinstall setuptools