Pytorch C++/CUDA Extension

最新推荐文章于 2024-10-03 12:24:27 发布

77summer77

最新推荐文章于 2024-10-03 12:24:27 发布

阅读量2k

点赞数 2

分类专栏： cuda c 文章标签： cuda pytorch c++

本文链接：https://blog.csdn.net/weixin_42930699/article/details/121584651

版权

cuda 同时被 2 个专栏收录

2 篇文章 2 订阅

订阅专栏

2 篇文章 0 订阅

订阅专栏

Pytorch C++/CUDA Extension

入门级extension构建
官方文档
！一个.cu代码究竟是怎么实现并行的
很好的文章

入门级extension构建

首先参考一个入门Blog-------->非常好的Pytorch 拓展Blog

官方文档

官方文档入口-------->Pytorch Extension

C++ Extension

building setuptools

按照格式写setup.py，主要是用来编译cpp文件

writing the C++ Op

关于<torch/extension.h>,is the one-stop header to include all the necessary PyTorch bits to write C++ extensions.包括：

The ATen library, which is our primary API for tensor computation,ATI库是pytorch用来tensor运算
pybind11, which is how we create Python bindings for our C++ code,pybind将python代码与C++代码捆绑
Headers that manage the details of interaction between ATen and pybind11.
主要分三步走

forward part-cpp

backward part-cpp

binding to python

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("forward", &lltm_forward, "LLTM forward");
  m.def("backward", &lltm_backward, "LLTM backward");
}

这里主要是链接的作用，在python里就可以通过LLTM.forward（）或者LLTM.backward（）调用写好的forward和backward cpp函数了

使用extension

python setup.py install

使用该命令build和install写好的extension
build过程会生成build文件夹，install（狭义上的，因为install本身就包括了build和install）会生成dist文件夹
install好之后就可以在当前python下导入这个extesnsion了，名字与setup中名字一致

关于兼容性问题
A small note on compilers: Due to ABI versioning issues, the compiler you use to build your C++ extension must be ABI-compatible with the compiler PyTorch was built with. In practice, this means that you must use GCC version 4.9 and above on Linux. For Ubuntu 16.04 and other more-recent Linux distributions, this should be the default compiler already. On MacOS, you must use clang (which does not have any ABI versioning issues). In the worst case, you can build PyTorch from source with your compiler and then build the extension with that same compiler.
大概意思就是GCC的版本必须和pytorch的GCC编译版本一致，pytorch使用conda install的话个人理解应该就是官方的gcc版本了，否则就要使用你的gcc直接编译pytorch源码，这样就能与你的gcc版本一致

与torch.autogard.Function和torch.nn.module包装

感觉像是在写forward计算以及需要储存的tensor，然后再在backward中将它们取出来

ctx.save_for_backward(*variables) #存，这个*是指指针还是引用？
*ctx.saved_tensors  #取

C++/CUDA Extension

C++ file

declare functions that are defined in CUDA (.cu) files.----->声明cuda函数
do some checks ----->检查tensor是否is_cuda和is_contiguous
forward its calls to the CUDA functions------->调用cuda函数和check，等于写个空壳
pybind------->链接python与cpp

CUDA .cu file

定义基本运算函数

template <typename scalar_t>
__device__ __forceinline__ scalar_t d_sigmoid(scalar_t z) {
  const auto s = sigmoid(z);
  return (1.0 - s) * s;
}

template <typename scalar_t>
__device__ __forceinline__ scalar_t d_tanh(scalar_t z) {
  const auto t = tanh(z);
  return 1 - (t * t);
}

template <typename scalar_t>
__device__ __forceinline__ scalar_t elu(scalar_t z, scalar_t alpha = 1.0) {
  return fmax(0.0, z) + fmin(0.0, alpha * (exp(z) - 1.0));
}

template <typename scalar_t>
__device__ __forceinline__ scalar_t d_elu(scalar_t z, scalar_t alpha = 1.0) {
  const auto e = exp(z);
  const auto d_relu = z < 0.0 ? 0.0 : 1.0;
  return d_relu + (((alpha * (e - 1.0)) < 0.0) ? (alpha * e) : 0.0);
}

这里面用到了一些cuda的特殊声明

device
forceinline
exp
。。。。。等等
感觉形式都差不多

To now actually implement a function, we’ll again need two things: one function that performs operations we don’t wish to explicitly write by hand and calls into CUDA kernels, and then the actual CUDA kernel for the parts we want to speed up.
可以看到接下来的函数将继续分为不需要手写的核函数部分以及需要加速的核函数部分

无需手写的函数部分

这个函数里面都会有一个对核函数的调用接口，一般为

AT_DISPATCH_FLOATING_TYPES(gates.type(), "lltm_forward_cuda", ([&] {
    lltm_cuda_forward_kernel<scalar_t><<<blocks, threads>>>(
        args);
  }));

这里的AT_DISPATCH_FLOATING_TYPES，看文档的意思应该是类似于模版的感觉，需要根据实例化后的数据类型来定义具体的scalar_t的类型（只能为float和double），若需要所有类型均可实力，则需要改为AT_DISPATCH_ALL_TYPES。-------------->有个问题就是为啥不直接定义为模板呢？（可能我的理解有问题）

accessor

cpu版.accessor<>
kernel版 ACCESSOR

！一个.cu代码究竟是怎么实现并行的

所有的均在index和for循环中

index=blockDim.x * blockIdx.x + threadIdx.x

这就表示这个index是根据一个grid来算其一个并行操作了（即一个grid的线程同时运算）
所以在for循环中，会有这样的逻辑

stride = blockDim.x * gridDim.x
for(i=index;i<N;i+=stride)

这样就表示若一个grid中的线程若不够执行一次并行，则需要几个循环来完成所有x维度上的operation，同理，在其他y，z维度上的并行也遵循相同的道理。

很好的文章

知乎上关于cuda extension的好文，评论也要留意

77summer77

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录