如何编写自己的cuda算法并导入pytorch

最新推荐文章于 2024-05-06 17:05:30 发布

每天都不想学习的Cora

最新推荐文章于 2024-05-06 17:05:30 发布

阅读量958

点赞数 1

文章标签： pytorch cuda

本文链接：https://blog.csdn.net/weixin_43978293/article/details/119929287

版权

编写自己的cuda算法并导入pytorch

一、涉及文件
二、cuda编程
三、Pytorch安装及调用接口

一、涉及文件

仅大致记录如何在pytorch工程中使用自己编写的cuda文件，以及一些必要的要素，并不对cuda算法的本身进行讲解（因为我也不太懂……）
不从头学习CUDA编程，基本上按照别的文件照葫芦画瓢即可（若有不对请指正）
以PointNet++算法中的BallQuery为例（用的工程是OpenPCDet）。
本文讲解的是先编译好再调用的方式。另外还有在运行时调用的方式，暂不作讲解。

首先涉及的需要自己编写的文件如下：
ball_query.cpp
ball_query_gpu.h
ball_query_gpu.cu

与在工程中调用和安装有关的文件如下：
setup.py
pointnet2_api.cpp（如果你的工程中没有，可以自己写一个）

二、cuda编程

在工程中合适的位置建立文件夹保存自己的cuda文件，如果有已经编写好的上述3个文件，可直接进行下一步。

1.首先新建3个文件，分别是.cpp .h .cu类型，文件名以你的算法命名，最好.cpp的命名和.h .cu不同，例如BallQuery的3个文件分别是：
ball_query.cpp
ball_query_gpu.h
ball_query_gpu.cu
2.先来看.cpp文件的编写（ball_query.cpp）：

#include <torch/serialize/tensor.h>
#include <vector>
#include <THC/THC.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include "ball_query_gpu.h"  //改为自己的.h文件，其他头文件都保留

extern THCState *state; //不知道啥用，留着

#define CHECK_CUDA(x) do { \
  if (!x.type().is_cuda()) { \
    fprintf(stderr, "%s must be CUDA tensor at %s:%d\n", #x, __FILE__, __LINE__); \
    exit(-1); \
  } \
} while (0)
#define CHECK_CONTIGUOUS(x) do { \
  if (!x.is_contiguous()) { \
    fprintf(stderr, "%s must be contiguous tensor at %s:%d\n", #x, __FILE__, __LINE__); \
    exit(-1); \
  } \
} while (0)
#define CHECK_INPUT(x) CHECK_CUDA(x);CHECK_CONTIGUOUS(x)//这部分用得到的话直接保留

int ball_query_wrapper_stack(int B, int M, float radius, int nsample,
    at::Tensor new_xyz_tensor, at::Tensor new_xyz_batch_cnt_tensor,
    at::Tensor xyz_tensor, at::Tensor xyz_batch_cnt_tensor, at::Tensor idx_tensor) {//输入参数，按照此格式写即可
    CHECK_INPUT(new_xyz_tensor);
    CHECK_INPUT(xyz_tensor);
    CHECK_INPUT(new_xyz_batch_cnt_tensor);
    CHECK_INPUT(xyz_batch_cnt_tensor);//对tensor变量连续性等的检查

    const float *new_xyz = new_xyz_tensor.data<float>();
    const float *xyz = xyz_tensor.data<float>();
    const int *new_xyz_batch_cnt = new_xyz_batch_cnt_tensor.data<int>();
    const int *xyz_batch_cnt = xyz_batch_cnt_tensor.data<int>();
    int *idx = idx_tensor.data<int>(); //应该是把张量数据类型转换为c++中的类型

    ball_query_kernel_launcher_stack(B, M, radius, nsample, new_xyz, new_xyz_batch_cnt, xyz, xyz_batch_cnt, idx);
    //调用gpu函数，该函数位于.cu文件，在.h文件中声明
    return 1;
}

c++文件相当于最外面的一个包，最后调用就是调用的ball_query_wrapper_stack函数，完成的主要功能包括输入参数的检查以及类型的转换，通过调用.cu中的ball_query_kernel_launcher_stack函数完成具体的cuda操作。

3.再来看.h文件的写法（ball_query_gpu.h）：

#ifndef _STACK_BALL_QUERY_GPU_H  //定义了一个宏，暂时不知道有啥用
#define _STACK_BALL_QUERY_GPU_H  //但是有这两行的话一定得改成自己的算法名字

#include <torch/serialize/tensor.h>
#include <vector>
#include <cuda.h>
#include <cuda_runtime_api.h>

int ball_query_wrapper_stack(int B, int M, float radius, int nsample,
    at::Tensor new_xyz_tensor, at::Tensor new_xyz_batch_cnt_tensor,
    at::Tensor xyz_tensor, at::Tensor xyz_batch_cnt_tensor, at::Tensor idx_tensor); 
    //.cpp中的函数的声明


void ball_query_kernel_launcher_stack(int B, int M, float radius, int nsample,
    const float *new_xyz, const int *new_xyz_batch_cnt, const float *xyz, const int *xyz_batch_cnt, int *idx);
    //.cu中launcher函数的声明
    
#endif

照这个格式声明自己的函数即可。

4.最后是.cu文件，这个文件中包含了算法的具体实施（ball_query_gpu.cu）：

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#include "ball_query_gpu.h" //改为自己的.h文件
#include "cuda_utils.h"  //这个里面定义了THREADS_PER_BLOCK和DIVUP操作，所以也加进来

//具体的cuda算法实施
__global__ void ball_query_kernel_stack(int B, int M, float radius, int nsample, \
    const float *new_xyz, const int *new_xyz_batch_cnt, const float *xyz, const int *xyz_batch_cnt, int *idx) {

    int pt_idx = blockIdx.x * blockDim.x + threadIdx.x;
    //因为的每个数据分配一个线程，因此根据当前block和thread的索引可以计算出里面存放的数据的索引
    //blockDim.x为block在x的维数
    //blockIdx.x为当前这个block的x索引
    //threadIdx.x为当前线程的x索引
    //（这里bolck和thread应该是都只有一维吧……）
    if (pt_idx >= M) return;
    //可以根据pt_idx得到其他的数据
    //下面是具体的算法，需要自己写
}

//launcher函数
void ball_query_kernel_launcher_stack(int B, int M, float radius, int nsample,
    const float *new_xyz, const int *new_xyz_batch_cnt, const float *xyz, const int *xyz_batch_cnt, int *idx){
    cudaError_t err;
    
    dim3 blocks(DIVUP(M, THREADS_PER_BLOCK));  
    // DIVUP是上取整操作，这里定义了block的数量。总的计算数据量为M，
    //又预定义了THREADS_PER_BLOCK，因此block的数量为M/THREADS_PER_BLOCK
    //也有的block的维数为二维，这里为一维
    dim3 threads(THREADS_PER_BLOCK);
    //定义了线程数，这里是256，表示每个block有256个线程

    ball_query_kernel_stack<<<blocks, threads>>>(B, M, radius, nsample, new_xyz, new_xyz_batch_cnt, xyz, xyz_batch_cnt, idx);
    //调用了具体的kernel实施函数，格式为kernel<<<num_block, num_thread>>>(a, b, c)
    err = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
}

把以上3个文件都编写好，就可以准备在pytorch中调用了。

三、Pytorch安装及调用接口

1.首先要对以上3个文件进行绑定，在pointnet2_api.cpp文件中进行。如果你有其他的api文件，可以直接在那个里面进行绑定（得是位于同一文件位置的），如果没有就新建一个（文件名随意）。

#include <torch/serialize/tensor.h>
#include <torch/extension.h>

#include "ball_query_gpu.h"
#include "group_points_gpu.h"
#include "sampling_gpu.h"
#include "interpolate_gpu.h"
#include "voxel_query_gpu.h" //把自己当.h文件include进来


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
//格式为m.def("python调用时使用的函数名，自己定义", &cpp中的相关函数名, "python调用时的提示信息")
    m.def("ball_query_wrapper", &ball_query_wrapper_stack, "ball_query_wrapper_stack");
    m.def("voxel_query_wrapper", &voxel_query_wrapper_stack, "voxel_query_wrapper_stack");

    m.def("furthest_point_sampling_wrapper", &furthest_point_sampling_wrapper, "furthest_point_sampling_wrapper");

    m.def("group_points_wrapper", &group_points_wrapper_stack, "group_points_wrapper_stack");
    m.def("group_points_grad_wrapper", &group_points_grad_wrapper_stack, "group_points_grad_wrapper_stack");

    m.def("three_nn_wrapper", &three_nn_wrapper_stack, "three_nn_wrapper_stack");
    m.def("three_interpolate_wrapper", &three_interpolate_wrapper_stack, "three_interpolate_wrapper_stack");
    m.def("three_interpolate_grad_wrapper", &three_interpolate_grad_wrapper_stack, "three_interpolate_grad_wrapper_stack");
}

2.接着在工程的setup.py中导进来并编译
这个也有很多写法，但主要就是把你自己的.cpp和.cu文件的目录加进去，可以参照别的文件是怎么写的，这个pointnet2_stack_cuda就是后续调用的时候需要import的包的名字。

make_cuda_ext(
                name='pointnet2_stack_cuda',
                module='pcdet.ops.pointnet2.pointnet2_stack',
                sources=[
                    'src/pointnet2_api.cpp',
                    'src/ball_query.cpp',
                    'src/ball_query_gpu.cu',
                    'src/group_points.cpp',
                    'src/group_points_gpu.cu',
                    'src/sampling.cpp',
                    'src/sampling_gpu.cu', 
                    'src/interpolate.cpp', 
                    'src/interpolate_gpu.cu',
                    'src/voxel_query.cpp', 
                    'src/voxel_query_gpu.cu',
                ],
            ),

然后使用以下命令行即可安装包。

python setup.py install

关于后面是使用install还是develop或者别的什么还是要看自己的setup文件，一般应该是install，不过OpenPCDet要用develop（这个不大懂）。

使用方式就是包名.函数名(参数)，例如pointnet2_stack_cuda.ball_query_wrapper(a,b,c)。

每天都不想学习的Cora

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
如何编写自己的cuda算法并导入pytorch

编写自己的cuda算法并导入pytorch一、涉及文件二、cuda编程三、Pytorch安装及调用接口一、涉及文件仅大致记录如何在pytorch工程中使用自己编写的cuda文件，以及一些必要的要素，并不对cuda算法的本身进行讲解（因为我也不太懂……）不从头学习CUDA编程，基本上按照别的文件照葫芦画瓢即可（若有不对请指正）以PointNet++算法中的BallQuery为例（用的工程是OpenPCDet）。本文讲解的是先编译好再调用的方式。另外还有在运行时调用的方式，暂不作讲解。首先涉及的需要
复制链接

扫一扫