pytorch编写cuda/c++ extention 方法

最新推荐文章于 2024-05-24 09:44:01 发布

RAFA CHANG

最新推荐文章于 2024-05-24 09:44:01 发布

阅读量1.5k

点赞数 3

分类专栏：深度学习&计算机视觉文章标签： pytorch cuda c++ 深度学习

本文链接：https://blog.csdn.net/CCCCXXXXGGGG/article/details/113482675

版权

深度学习&计算机视觉专栏收录该内容

7 篇文章 0 订阅

订阅专栏

起因

我在尝试一篇目标识别的论文的开源代码时，发现由于其NMS、LSTM等模块是使用c++编译的（为了弥补python速度的短板），对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像，不过Faster-rcnn官方更新了其包含的c++文件，从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本，于是尝试重写c++/cuda 扩展的模块

prerequisite

CUDA10.0
pytorch1.4+cu100
torchvision0.5.0+cu100
setuptools 52.0.0

方法

由于原始代码采用较老的方式（THC模块）实现扩展，这里尝试pytorch官方 tutorial的方法实现

https://pytorch.org/tutorials/advanced/cpp_extension.html#writing-a-c-extension

然而，这教程有一个很致命的问题，按照它的方法用不了，被坑了。所以…先看看作为参考吧。

概述

以NMS模块为例，文件树如下：

|project name
|---cuda
|   |---nms_kernel.cu
|   |---nms_kernel.h
|---nms_cuda.cpp
|---nms_cuda.h
|---build.py

其中nms_kernel.cu包括cuda核函数和调用它的函数A，nms_cuda.c又调用A函数，最后通过build.py编译扩展模块

##build.py

import os
import torch
from setuptools import setup, Extension
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
from setuptools import setup, Extension
from torch.utils import cpp_extension
# Might have to export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}} ?
torch._C._GLIBCXX_USE_CXX11_ABI=0  //设置编译的参数
sources = []
headers = []
defines = []
inl = []
with_cuda = False

if torch.cuda.is_available():
    print('Including CUDA code.')
    sources += ['src/cuda/nms_kernel.cu','src/nms_cuda.cpp']  //待编译的c与cu文件
    # sources += ['src/nms_cuda.cpp']
    headers += ['src/','src/cuda/'] //头文件目录
    defines += [('WITH_CUDA', None)]
    with_cuda = True
    inl +=['src/','src/cuda/']

this_file = os.path.dirname(os.path.realpath(__file__))
print(this_file)
setup(
    name='ext.nms',  //拓展包名，不太重要
    ext_modules=[
        CUDAExtension('extnms',sources=sources,include_dirs= headers ,
    define_macros=defines,  
    extra_compile_flag=('-std=c++11'),//使用c11编译
    with_cuda=with_cuda

                      )],

    cmdclass={
        'build_ext': BuildExtension
    }

)

考虑到有人是源码编译pytorch，那么就需要注意，编译pytorch时的-D GLIBCXX_USE_CXX11_ABI标志位，extention编译的标志位要和pytorch编译的一致，一般来说，pip安装不用考虑这个问题

cpp文件

该文件主要特点是：使用torch::Tensor代替原本的指针、数组类型（当然可以尝试指定数据类型，例如torch.IntTensor）
宏PYBIND11_MODULE中第一个参数即函数名，第二个是该函数指针地址，最后一个参数是模块名（用于import …）

#include pip
#include <math.h>
#include "cuda/nms_kernel.h"  //cuda核函数的头文件
#include <torch/extension.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;    //用于c++到python的转换
extern THCState *state;
int ApplyNMSGPU(         //声明cuda函数，不要也可
    torch::Tensor keep_out,
    torch::Tensor boxes_dev,
    int boxes_num,
    torch::Tensor nms_overlap_thresh,
    int device_id);

int nms_apply(         //需要转换陈python的函数，里面调用了cuda函数ApplyNMSGPU
    torch::Tensor keep,
    torch::Tensor boxes_sorted,
    float nms_thresh
)
{
torch::Tensor keep_data = keep;
torch::Tensor boxes_sorted_data = boxes_sorted;
int boxes_num = boxes_sorted.size(0);
int devId = boxes_sorted.get_device();

 return ApplyNMSGPU(keep_data, boxes_sorted_data, boxes_num, nms_thresh, devId);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("nms_apply", &nms_apply, "nms");
}

cpp的头文件

#include <torch/extension.h>
int nms_apply(
    torch::Tensor keep,
    torch::Tensor boxes_sorted,
    float nms_thresh
)；

cuda文件

你只需要关注：
·数据类型：scalar_t、scalar_t*、size_t 他们都是模板数据类型，在后面确定具体是什么。
·宏AT_DISPATCH_ALL_TYPE，参数一：决定scalar_t是哪种具体数据类型，参数三：lambla函数，注意核函数的参数数据类型，对于torch:tensor的，需要.data<scalar_t>()，但是，如果是别的类型（例如指针）还用.data<scalar_t>()的话会报错：error: expression must have class type


//#include <ATen/ATen.h>
#include <vector>
#include <iostream>
#include <torch/extension.h>
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK(condition) \
  /* Code block avoids redefinition of cudaError_t error */ \
  do { \
    cudaError_t error = condition; \
    if (error != cudaSuccess) { \
      std::cout << cudaGetErrorString(error) << std::endl; \
    } \
  } while (0)

#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
const long  threadsPerBlock = sizeof(unsigned long long) * 8;
/*
__device__ inline float devIoU(scalar_t*  a, scalar_t*  b) {
  float left = max(a[0], b[0]), right = min(a[2], b[2]);
  float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
  float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
  float interS = width * height;
  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
  return interS / (Sa + Sb - interS);
}
*/
cxg script/
template <typename scalar_t>
__device__ __forceinline__ scalar_t devIoU( scalar_t*  a, scalar_t*  b) {
  float left = max(a[0], b[0]), right = min(a[2], b[2]);
  float top = max(a[1], b[1]), bottom = min(a[3], b[3]);
  float width = max(right - left + 1, 0.f), height = max(bottom - top + 1, 0.f);
  float interS = width * height;
  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
  return interS / (Sa + Sb - interS);
}
//


///cxg script//
template <typename scalar_t>
__global__ void nms_kernel(
    const size_t __restrict__ n_boxes,
    scalar_t__restrict__ nms_overlap_thresh,
    scalar_t* __restrict__ dev_boxes,
    unsigned long long* __restrict__ dev_mask
    ) {
 const int row_start = blockIdx.y;
  const int col_start = blockIdx.x;

  // if (row_start > col_start) return;

  const int row_size =
        min(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
  const int col_size =
        min(n_boxes - col_start * threadsPerBlock, threadsPerBlock);

  __shared__ scalar_t block_boxes[threadsPerBlock * 5];
  if (threadIdx.x < col_size) {
    block_boxes[threadIdx.x * 4 + 0] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 0];
    block_boxes[threadIdx.x * 4 + 1] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 1];
    block_boxes[threadIdx.x * 4 + 2] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 2];
    block_boxes[threadIdx.x * 4 + 3] =
        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 4 + 3];
  }
  __syncthreads();

  if (threadIdx.x < row_size) {
    const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
    scalar_t* cur_box = dev_boxes + cur_box_idx * 4;
    int i = 0;
    unsigned long long t = 0;
    int start = 0;
    if (row_start == col_start) {
      start = threadIdx.x + 1;
    }
    for (i = start; i < col_size; i++) {
      if (devIoU(cur_box, block_boxes + i * 4) > nms_overlap_thresh) {
        t |= 1ULL << i;
      }
    }
    const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
    dev_mask[cur_box_idx * col_blocks + col_start] = t;
  }
}


//
void _set_device(int device_id) {
  int current_device;
  CUDA_CHECK(cudaGetDevice(&current_device));
  if (current_device == device_id) {
    return;
  }
  // The call to cudaSetDevice must come before any calls to Get, which
  // may perform initialization using the GPU.
  CUDA_CHECK(cudaSetDevice(device_id));
}

cxg script ///
int ApplyNMSGPU(
    torch::Tensor keep_out,
    torch::Tensor boxes_dev,
    int boxes_num,
    float nms_overlap_thresh,
    int device_id) {
 _set_device(device_id);

  unsigned long long* mask_dev = NULL;

  const auto col_blocks = DIVUP(boxes_num, threadsPerBlock);

  CUDA_CHECK(cudaMalloc(&mask_dev,
                        boxes_num * col_blocks * sizeof(unsigned long long)));

  dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
              DIVUP(boxes_num, threadsPerBlock));
  dim3 threads(threadsPerBlock);
 AT_DISPATCH_ALL_TYPES(
 boxes_dev.type(), "ApplyNMSGPU", ([&] {
    nms_kernel<scalar_t><<<blocks, threads>>>(
        boxes_num,
        nms_overlap_thresh,
       boxes_dev.data<scalar_t>(),
        mask_dev
       );
  }));
  std::vector<unsigned long long> mask_host(boxes_num * col_blocks);
  CUDA_CHECK(cudaMemcpy(&mask_host[0],
                        mask_dev,
                        sizeof(unsigned long long) * boxes_num * col_blocks,
                        cudaMemcpyDeviceToHost));

  std::vector<unsigned long long> remv(col_blocks);
  memset(&remv[0], 0, sizeof(unsigned long long) * col_blocks);

  int num_to_keep = 0;
  for (int i = 0; i < boxes_num; i++) {
    int nblock = i / threadsPerBlock;
    int inblock = i % threadsPerBlock;

    if (!(remv[nblock] & (1ULL << inblock))) {
      keep_out[num_to_keep++] = i;
      auto p = &mask_host[0] + i * col_blocks;
      for (int j = nblock; j < col_blocks; j++) {
        remv[j] |= p[j];
      }
    }
  }

  CUDA_CHECK(cudaFree(mask_dev));
  return num_to_keep;
}



/

cuda头文件

#include <torch/extension.h>
int ApplyNMSGPU(
    torch::Tensor keep_out,
    torch::Tensor boxes_dev,
    int boxes_num,
    float nms_overlap_thresh,
    int device_id);

编译：

命令：python build.py install
结果：你可以看到所有的nvcc gcc编译器的编译参数，确定是否正确

building 'extnms' extension
/usr/local/cuda/bin/nvcc -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/cuda/nms_kernel.cu -o build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++11
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorTypeSet.h(44): warning: integer conversion resulted in a change of sign


x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -Isrc/ -Isrc/cuda/ -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c src/nms_cuda.cpp -o build/temp.linux-x86_64-3.8/src/nms_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=extnms -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.8/src/cuda/nms_kernel.o build/temp.linux-x86_64-3.8/src/nms_cuda.o -L/usr/local/cuda/lib64 -lcudart -o build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so
creating build/bdist.linux-x86_64/egg
copying build/lib.linux-x86_64-3.8/extnms.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating stub loader for extnms.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/extnms.py to extnms.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying ext.nms.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
__pycache__.extnms.cpython-38: module references __file__
creating 'dist/ext.nms-0.0.0-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing ext.nms-0.0.0-py3.8-linux-x86_64.egg
removing '/usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg' (and everything under it)
creating /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Extracting ext.nms-0.0.0-py3.8-linux-x86_64.egg to /usr/local/lib/python3.8/dist-packages
ext.nms 0.0.0 is already the active version in easy-install.pth

Installed /usr/local/lib/python3.8/dist-packages/ext.nms-0.0.0-py3.8-linux-x86_64.egg
Processing dependencies for ext.nms==0.0.0
Finished processing dependencies for ext.nms==0.0.0

调用

import pytorch
import extnms
…

小结

我哼多天卡在类似lltm_cpp.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC1ENS_14NMSAPPLYGPUXXXX这种错误上，这种错误就是编译时，库使用了未知的函数，在这个项目里，体现在必须使用.h文件，然鹅官方教程就是没有用.h文件，导致这个错误

运行setup.py时报错：

ValueError: bad marshal data (unknown type code)

solusion:

pip3 install --upgrade --force-reinstall setuptools

RAFA CHANG

关注

3
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
pytorch编写cuda/c++ extention 方法

起因我在尝试一篇目标识别的论文的开源代码时，发现由于其NMS、LSTM等模块是使用c++编译的（为了弥补python速度的短板），对Pytorch要求不能使用超过0.3的版本。和FASTER RCNN开源代码很像，不过Faster-rcnn官方更新了其包含的c++文件，从而可以在pytorch大于1的版本上使用。我不想退回pytorch 0.3版本，于是尝试重写c++/cuda 扩展的模块prerequisiteCUDA10.0pytorch1.4+cu100torchvision0.5.0+cu
复制链接

扫一扫

专栏目录