如何通过C++及CUDA扩展Pytorch

最新推荐文章于 2024-07-16 13:27:06 发布

远1

最新推荐文章于 2024-07-16 13:27:06 发布

阅读量3.5k

点赞数 4

分类专栏： pytorch 算法学习 TensorFlow 文章标签：深度学习 pytorch

本文链接：https://blog.csdn.net/u011509971/article/details/104064178

版权

TensorFlow 同时被 3 个专栏收录

7 篇文章 0 订阅

订阅专栏

pytorch

1 篇文章 0 订阅

订阅专栏

算法学习

1 篇文章 0 订阅

订阅专栏

Pytorch虽然已经使用了NVIDIA cuDNN、Intel MKL和NNPACK这些底层来加快训练速度，但是在某些情况下，比如我们要实现一些特定算法，光靠组合Pytorch已有的操作是不够的。这是因为Pytorch虽然在特定操作上经过了很好的优化，但是对于Pytorch已经写好的这些操作，假如我们组合起来，组成我们的新的算法，Pytorch才不管你的算法的具体执行流程，一般Pytorch只会按照设计好的操作去使用GPU的通道，这样会导致通道不能充分利用或者直接超负载，并且python解释器也不能对此进行优化，导致程序执行速度变慢。

准备工作

由于在服务器上开发，一般用户是没有root权限的，这就导致用户在开发过程中遇到依赖软件不兼容时没有权限去更新依赖软件。为了避免这些问题，首先我们在自己的用户目录下配置所需依赖环境。
第一步需要安装anaconda3环境，在官网下载相应版本，根据官网命令执行安装就行。
安装好anaconda3后输入python,发现启动的还是系统自带的python环境，这时需要通过PATH导入anaconda3的python环境。我执行的命令是export PATH=/home/guangyuan/anaconda3/bin:$PATH，该命令需要根据你安装的anaconda3路径而做相应调整。
第二步通过 conda create -n venv pip python=3.6 创建一个新的虚拟环境，这样可以在新的虚拟环境下进行开发，一旦出现错，直接删除虚拟环境，重新配置即可。安装完虚拟环境需要通过 source activate venv 来激活虚拟化境，从而进入新创建的虚拟环境开发。如果要退出虚拟环境只要输入source deactivate即可。

注意： 激活新创建的虚拟环境后需要通过PYTHONPATH导入该虚拟环境下安装的一些库。命令如下：export PYTHONPATH=/home/guangyuan/anaconda3/envs/venv/lib/python3.6/site-packages:$PYTHONPATH；具体路径需要根据自己配置的环境调整。
第三步安装pytorch1.2(安装参考官网教程https://pytorch.org/)
第四部通过以下命令安装一些依赖库：
pip install pytest
pip install pybind11
至此所需的环境配置基本完成，如果在开发中缺少一些库，可以直接通过pip install 库名安装即可。

用c++及CUDA进行编写程序

本文以向pytorch添加deptwise conv为例来一步步展示如何向pytorcht添加源码。
首先给出我们的工程组织结构，之后会详细介绍每个文件的的作用。

depthwise_conv
- depthwise_conv.py
- depthwise_conv_cuda.cpp
- depthwise_conv_kernel.cu
- setup.py
- __init__.py

首先有个depthwise_conv文件夹，文件夹里有五个文件。
涉及到c++和cuda的是depthwise_conv_cuda.cpp和depthwise_conv_kernel.cu，之后通过setup.py进行编译并绑定到python，最后封装成python模块，即depthwise_conv.py所做内容。通过c++和cuda扩展pytorch，一般需要这几个文件。

首先来看depthwise_conv_kernel.cu文件，该文件包含了向pytorch添加的层的主要实现过程。主要包含cuda kernel实现和对cuda kernel 调用函数的实现，该函数首先根据传入的参数，得到后续要用到的一些信息，比如根据输入的数据张量，我们可以得到该数据的batch size和channel大小等，同时还需要重新分配一些数据张量来保存计算结果。

我们以向pytorch 添加depthwise conv 来具体说明depthwise_conv_kernel.cu实现过成。我们知道通过c++和cuda向一些深度学习框架添加层，一般需要实现该层的forward和backward过程，而为了提高计算速度，forward和backward的核心计算都在cuda kernel里实现。对于depthwise_conv_kernel.cu 我们实现了

__global__ void ConvForward（）
__global__ void ConvBackward（）
__global__ void ConvBackwardWeight（）
__global__ void ConvBackwardBias（）等核函数和 DepthWiseConvForwardLaucher(）、DepthWiseConvBackwarddLaucher（）来封装cuda kernel 函数。

下面以depwise conv 的forward实现过程并结合源码具体讲解。首先要实现__global__ void ConvForward（）函数，代码实现如下：

template <typename scalar_t>
__global__ void ConvForward(const int nthreads,
		const scalar_t* const bottom_data, const int num, const int channels,
		const int height, const int width,const int conved_height,
		const int conved_width,const int kernel_h, const int kernel_w,
		const int stride_h, const int stride_w, const int pad_h, const int pad_w,
		scalar_t* const top_data,const scalar_t* const weight,const scalar_t* const bias,const bool bias_term_) {
	CUDA_KERNEL_LOOP(index, nthreads) {

		const int pw = index % conved_width;
		const int ph = (index / conved_width) % conved_height;
		const int c = (index / conved_width / conved_height) % channels;
		const int n = index / conved_width / conved_height / channels;
		int hstart = ph * stride_h - pad_h;
		int wstart = pw * stride_w - pad_w;
		int hend = min(hstart + kernel_h, height + pad_h);
		int wend = min(wstart + kernel_w, width + pad_w);
		hstart = max(hstart, 0);
		wstart = max(wstart, 0);
		hend = min(hend, height);
		wend = min(wend, width);
		scalar_t aveval = 0;
		const scalar_t* const bottom_slice =
		bottom_data + (n * channels + c) * height * width;
		const scalar_t* const weight_slice =
		weight + c * kernel_h * kernel_w;
		int khstart=hend<kernel_h?kernel_h-hend:0;
		int kwstart=wend<kernel_w?kernel_w-wend:0;
		for (int h = hstart; h < hend; ++h) {
			for (int w = wstart; w < wend; ++w) {

				aveval += bottom_slice[h * width + w]*weight_slice[(khstart+h-hstart) * kernel_w + (kwstart+w-wstart)];

			}
		}
		if(bias_term_) {
			aveval+=bias[c];
		}
		top_data[index] = aveval;
	}
}

该段代码就是depthwise conv的forward cuda kernel实现过程，和pytorch框架无关，需要cuda编程基础。

之后我们需要实现一个函数，该函数需要实现如何把forward cuda kernel封装到pytorch框架，一般需要包含#include <ATen/ATen.h>

#include <THC/THCAtomics.cuh>这两个头文件。

代码具体实现如下：

#include <ATen/ATen.h>
#include <THC/THCAtomics.cuh>

at::Tensor DepthWiseConvForwardLaucher(const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                       const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                       const int pad_h, const int pad_w, const bool bias_term_){
    //根据input得到batch_size、channels、input_height和input_width等信息以便后面使用
    const auto batch_size = input.size(0);
    const auto channels = input.size(1);
    const auto input_height = input.size(2);
    const auto input_width = input.size(3);
    
	//计算depthwise conv 之后输出Tensor的height和width
    const auto kernal_extent_h = /* dilation_h * */ (kernel_h - 1) + 1;
    const auto conved_height = (input_height + 2 * pad_h - kernal_extent_h) / stride_h + 1;

    const auto kernal_extent_w = /* dilation_w * */ (kernel_w - 1) + 1;
    const auto conved_width = (input_width + 2 * pad_w - kernal_extent_w) / stride_w + 1;
    
    //初始化输出Tensor，以便存放depthwise conv forward cuda kernel计算输出结果，默认值设置为0。
    IntList size = {batch_size, channels, conved_height, conved_width};
    auto output = at::zeros(size, input.options());
    const auto count = batch_size * channels * conved_height * conved_width;
	
    //通过pytorch提供的接口调用cuda kernel(本代码调用ConvForward kernel)
    AT_DISPATCH_FLOATING_TYPES_AND_HALF(
        input.type(), "ConvLaucherForward",
        ([&]{
            const scalar_t *bottom_data = input.data<scalar_t>();
            scalar_t *top_data = output.data<scalar_t>();
            const scalar_t *depthwise_weight = weight.data<scalar_t>();
           
            if (bias_term_){
                const scalar_t *depthwise_bias = bias.data<scalar_t>();
                ConvForward<scalar_t><<<GET_BLOCKS(count), THREADS_PER_BLOCK>>>(count, bottom_data, batch_size,
                    channels, input_height, input_width, conved_height, conved_width, kernel_h, kernel_w, stride_h,
                    stride_w, pad_h, pad_w, top_data, depthwise_weight, depthwise_bias, bias_term_);
            }else{
                ConvForward<scalar_t><<<GET_BLOCKS(count), THREADS_PER_BLOCK>>>(count, bottom_data, batch_size,
                    channels, input_height, input_width, conved_height, conved_width, kernel_h, kernel_w, stride_h,
                    stride_w, pad_h, pad_w, top_data, depthwise_weight, 0, bias_term_);
            }
        }));
    cudaError_t err = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
    return output;
}

该函数一般需要做以下几件事：1、根据输入Tensor得到得到batch_size、channels、input_height和input_width等信息以便后面使用；2、根据得到的信息和计算公式计算出输出Tensor维度大小；3、根据计算出的输出Tensor维度信息为输出Tensor申请内存；4、通过 pytorch提供的AT_DISPATCH_FLOATING_TYPES_AND_HALF接口调用cuda kernel函数。

因为我们的数据存放在 at::Tensor定义的变量中，所以需要了解at::Tensor的一些常用函数，比如通过.size()

可以得到Tensor维度信息，通过.data（）可以得到Tensor的裸指针，通过.type()可以拿到该Tensor的数据类型等。

下面简单介绍一下AT_DISPATCH_FLOATING_TYPES_AND_HALF用法，AT_DISPATCH_FLOATING_TYPES_AND_HALF需要给三个参数，第一个为数据类型可以通过输入tensor调用.type()

得到，第二个为一个字符串，没什么特别要求，第三个参数接受一个匿名函数，调用cuda kernel 函数逻辑都在该匿名函数里实现。至此depthwise conv forward过程已经完成。

对于deptwise conv的backward实现步骤和forward的一样，这里就不做详细介绍，只给出源码实现。

//对输入张量（Tensor）、权重Weight和Bias实现backward过程。
template <typename scalar_t>
__global__ void ConvBackward(const int nthreads,
const scalar_t* const top_diff,
const int num, const int channels, const int height,
const int width, const int conved_height, const int conved_width,
const int kernel_h, const int kernel_w, const int stride_h,
const int stride_w, const int pad_h, const int pad_w,
scalar_t* const bottom_diff,
const scalar_t* const weight) {

	CUDA_KERNEL_LOOP(index, nthreads) {
		const int w = index % width + pad_w;
		const int h = (index / width) % height + pad_h;
		const int c = (index / width / height) % channels;
		const int n = index / width / height / channels;
		
		const int phstart = (h < kernel_h) ? 0 : (h - kernel_h) / stride_h + 1;
		const int phend = min(h / stride_h + 1, conved_height);
		const int pwstart = (w < kernel_w) ? 0 : (w - kernel_w) / stride_w + 1;
		const int pwend = min(w / stride_w + 1, conved_width);
		
		const int khstart=(h >= kernel_h) ? ((h-kernel_h)%stride_h)+(kernel_h-stride_h): h;
		const int kwstart=(w >= kernel_w) ? ((w-kernel_w)%stride_w)+(kernel_w-stride_w) : w;
		
		scalar_t gradient = 0;
		const scalar_t* const top_diff_slice =
		top_diff + (n * channels + c) * conved_height * conved_width;
		
		const scalar_t* const weight_slice =weight + c * kernel_h * kernel_w;
		
		for (int ph = phstart; ph < phend; ++ph) {
			for (int pw = pwstart; pw < pwend; ++pw) {
				int kh=khstart-(ph-phstart)*stride_h;
				int kw=kwstart-(pw-pwstart)*stride_w;
				gradient += top_diff_slice[ph * conved_width + pw] *weight_slice[kh*kernel_w+kw];
			}
		}
		bottom_diff[index] = gradient;
	}
}                             



template <typename scalar_t>
__global__ void ConvBackwardWeight(const int nthreads,
const scalar_t* const top_diff,
const int num, const int channels, const int height,
const int width, const int conved_height, const int conved_width,
const int kernel_h, const int kernel_w, const int stride_h,
const int stride_w, const int pad_h, const int pad_w,
scalar_t* const weight_diff,
const scalar_t* const bottom_data) {

	CUDA_KERNEL_LOOP(index, nthreads) {
		const int kw=index % kernel_w;
		const int kh= (index /kernel_w)%kernel_h;
		const int c=index /kernel_w/kernel_h;
        scalar_t gradient = 0;
		for( int n=0;n<num;n++) {
			
			const scalar_t* const top_diff_slice = top_diff + (n * channels + c) * conved_height * conved_width;
			const scalar_t* const bottom_data_slice = bottom_data + (n * channels + c) * height * width;
		
			
			const int phstart=max(DIVIDE_CEIL((pad_h-kh),stride_h),0);
			const int phend=min(DIVIDE_CEIL((height+pad_h-kh),stride_h),conved_height);
		
			const int pwstart=max(DIVIDE_CEIL((pad_w-kw),stride_w),0);
			
			const int pwend=min(DIVIDE_CEIL((width+pad_w-kw),stride_w),conved_width);

			for(int ph=phstart;ph<phend;ph++){
				for (int pw=pwstart;pw<pwend;pw++){
					const int h=ph*stride_h+kh-pad_h;
					const int w=pw*stride_w+kw-pad_w;
					gradient+=top_diff_slice[ph * conved_width + pw]*bottom_data_slice[h*width+w];
				}
			}
		}
		weight_diff[c * kernel_h * kernel_w+kh*kernel_w+kw]+=gradient;
	}
}

template <typename scalar_t>
__global__ void ConvBackwardBias(const int nthreads,
const scalar_t* const top_diff,
const int num, const int channels, const int height,
const int width, const int conved_height, const int conved_width,
const int kernel_h, const int kernel_w, const int stride_h,
const int stride_w, const int pad_h, const int pad_w,
scalar_t* const bias_diff) {
	CUDA_KERNEL_LOOP(index, nthreads) {
		const int c = index;
		scalar_t gradient=0;
		for( int n=0;n<num;n++) {
			const scalar_t* const top_diff_slice =
			top_diff + (n * channels + c) * conved_height * conved_width;
			for(int ph=0;ph<conved_height;ph++) {
				for (int pw=0;pw<conved_width;pw++) {
					gradient+=top_diff_slice[ph * conved_width + pw];
				}
			}
		}
		bias_diff[c]+=gradient;
	}
}

//封装DepthWiseConvBackwarddLaucher函数调用cuda kernel 函数。
std::vector<at::Tensor> DepthWiseConvBackwarddLaucher(const at::Tensor output_grad, const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                                      const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                                      const int pad_h, const int pad_w, const bool bias_term_){
    const auto batch_size = input.size(0);
    const auto channels = input.size(1);
    const auto input_height = input.size(2);
    const auto input_width = input.size(3);

    const auto kernal_extent_h = /* dilation_h * */ (kernel_h - 1) + 1;
    const auto conved_height = (input_height + 2 * pad_h - kernal_extent_h) / stride_h + 1;
    const auto kernal_extent_w = /* dilation_w * */ (kernel_w - 1) + 1;
    const auto conved_width = (input_width + 2 * pad_w - kernal_extent_w) / stride_w + 1;

    const int count_weight = channels * kernel_h * kernel_w;
    const int count_input = batch_size * channels * input_height * input_width;

    auto weight_diff = at::zeros_like(weight);
    auto bottom_diff = at::zeros_like(input);
    at::Tensor bias_diff;
    int count_bias = 0;

    if (bias_term_){
        count_bias = channels;
        bias_diff = at::zeros_like(bias);
    }

    AT_DISPATCH_FLOATING_TYPES_AND_HALF(
        output_grad.type(), "ConvLaucherBackward",
        ([&]{
            const scalar_t *bottom_data = input.data<scalar_t>();
            const scalar_t *depthwise_weight = weight.data<scalar_t>();
            const scalar_t *top_diff = output_grad.data<scalar_t>();
            scalar_t *depthwise_weight_diff = weight_diff.data<scalar_t>();
            scalar_t *depthwise_bottom_diff = bottom_diff.data<scalar_t>();

            if (bias_term_){
                scalar_t *depthwise_bias_diff = bias_diff.data<scalar_t>();
                ConvBackwardBias<scalar_t><<<GET_BLOCKS(count_bias), THREADS_PER_BLOCK>>>(count_bias, top_diff, batch_size,
                    channels, input_height, input_width, conved_height, conved_width, kernel_h, kernel_w, stride_h,
                    stride_w, pad_h, pad_w, depthwise_bias_diff);
            }

            ConvBackwardWeight<scalar_t><<<GET_BLOCKS(count_weight), THREADS_PER_BLOCK>>>(count_weight, top_diff, batch_size,
                channels, input_height, input_width, conved_height, conved_width, kernel_h, kernel_w, stride_h,
                stride_w, pad_h, pad_w, depthwise_weight_diff, bottom_data);

            ConvBackward<scalar_t><<<GET_BLOCKS(count_input), THREADS_PER_BLOCK>>>(count_input, top_diff, batch_size,
                channels, input_height, input_width, conved_height, conved_width, kernel_h, kernel_w, stride_h,
                stride_w, pad_h, pad_w, depthwise_bottom_diff, depthwise_weight);
            
        }));
    cudaError_t err = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }

    if (bias_term_){
        return {bottom_diff, weight_diff, bias_diff};
    }
    else{
        return {bottom_diff, weight_diff};
    }
}

至此depthwise_conv_kernel.cu要做的工作已经完成。

之后我们会再创建一个.cpp文件，用了来封装DepthWiseConvForwardLaucher和 DepthWiseConvBackwarddLaucher函数，并把刚才通过c++及cuda实现的层绑定到Python上。

本文我们用depthwise_conv_cuda.cpp来实现上面说的功能，首先给出源码：

#include <torch/torch.h>
#include <cmath>
#include <vector>

at::Tensor DepthWiseConvForwardLaucher(const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                       const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                       const int pad_h, const int pad_w, const bool bias_term_);

std::vector<at::Tensor> DepthWiseConvBackwarddLaucher(const at::Tensor output_grad, const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                                      const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                                      const int pad_h, const int pad_w, const bool bias_term_);

#define CHECK_CUDA(x) AT_CHECK(x.type().is_cuda(), #x, " must be a CUDAtensor ")
#define CHECK_CONTIGUOUS(x) \
  AT_CHECK(x.is_contiguous(), #x, " must be contiguous ")
#define CHECK_INPUT(x) \
  CHECK_CUDA(x);       \
  CHECK_CONTIGUOUS(x)

at::Tensor depthwise_conv_forward_cuda(const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                       const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                       const int pad_h, const int pad_w, const bool bias_term_){
    CHECK_INPUT(input);
    CHECK_INPUT(weight);
    CHECK_INPUT(bias);
    return DepthWiseConvForwardLaucher(input, weight, bias, kernel_h, kernel_w, stride_h, stride_w,
                                       pad_h, pad_w, bias_term_);
}

std::vector<at::Tensor> depthwise_conv_backward_cuda(const at::Tensor output_grad, const at::Tensor input, const at::Tensor weight, const at::Tensor bias, 
                                                      const int kernel_h, const int kernel_w, const int stride_h, const int stride_w, 
                                                      const int pad_h, const int pad_w, const bool bias_term_){
  CHECK_INPUT(output_grad);
  CHECK_INPUT(input);
  CHECK_INPUT(weight);
  if(bias_term_){
    CHECK_INPUT(bias);
  }
  return DepthWiseConvBackwarddLaucher(output_grad, input, weight, bias, kernel_h, kernel_w, stride_h, stride_w, 
                                       pad_h, pad_w, bias_term_);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("forward", &depthwise_conv_forward_cuda, "Depthwise_Conv forward (CUDA)");
  m.def("backward", &depthwise_conv_backward_cuda, "Depthwise_Conv backward (CUDA)");
}

该源码做了以下事,：1、创建depthwise_conv_forward_cuda和depthwise_conv_backward_cuda函数，这两个主要做了两个事，一是通过调用CHECK_INPUT来检查输入Tensors是否符合，之后调用DepthWiseConvForwardLaucher和DepthWiseConvBackwarddLaucher；2、通过PYBIND11_MODULE把depthwise_conv_forward_cuda和depthwise_conv_backward_cuda绑定到python。

至此depthwise_conv_cuda.cpp要完成的工作已经完成，之后我们需要编写setup.py文件对源码进行编译，setup.py的内容如下：

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name='depethwise_conv',
    ext_modules=[
        CUDAExtension('depthwise_conv_cuda', [
            './depthwise_conv_cuda.cpp',
            './depthwise_conv_kernel.cu',
        ])
    ],
    cmdclass={'build_ext': BuildExtension})

在setup.py我们使用setuptools去编译我们的C++代码，CUDAExtension和BuildExtension可以很方便地让我们实现对C++和Cuda的编译。该代码可以通过简单修改移植到其它层的添加任务。

之后进入depthwise_conv目录下，通过python setup.py build_ext --inplace 便可以对源码进行编译。编译完成后我们会在depthwise_conv目录下看到编译生成的depthwise_conv_cuda.cpython-36m-x86_64-linux-gnu.so文件。之后我们可以进入python环境经行验证，注意此时我们是在depthwise_conv目录下进入python环境，即python环境的当前目录是在depthwise_conv下，而当前目录下有depthwise_conv_cuda.cpython-36m-x86_64-linux-gnu.so，可以import depthwise_conv_cuda；如果想在任意目录下import depthwise_conv_cuda，我们需要通过PYTHONPATH导入，命令如下export PYTHONPATH=/data/nfs_share/public/guangyuan/workplace/ops/depthwise_conv:$PYTHONPATH

要导入的PYTHONPATH路径需要根据自己的实际路径进行调整。

In [1]: import torch
In [2]: import depthwise_conv_cuda
In [3]: depthwise_conv_cuda.forward
Out[3]: <built-in method forward of PyCapsule object at 0x7fb70c690d50>

然后我们就可以定义我们的自定义层了，我们的自定义层在depthwise_conv.py下，实现如下：

import math
import collections
from itertools import repeat

import torch
import torch.nn as nn
from torch.autograd import Function 
from torch.nn.parameter import Parameter
import depthwise_conv_cuda

def _ntuple(n):
    def parse(x):
        if isinstance(x, collections.Iterable):
            return x
        return tuple(repeat(x, n))
    return parse

_pair = _ntuple(2)


class DepthWiseConvFunction(Function):
    @staticmethod
    def forward(ctx, inputs, weight, bias, kernel_size, stride=1, padding=0, use_bias=False):
        kernel_size = _pair(kernel_size)
        stride = _pair(stride)
        padding = _pair(padding)
        assert inputs.size(1) == weight.size(0)

        if use_bias:
            ctx.save_for_backward(*[inputs, weight, bias])
            # ctx.save_for_backward(inputs, weight, bias)
        else:
            bias = torch.zeros(1, dtype=weight.dtype, device=weight.device)
            ctx.save_for_backward(*[inputs, weight, bias])

        ctx.use_bias = use_bias
        ctx.kernel_size = kernel_size
        ctx.stride = stride
        ctx.padding = padding

        output = depthwise_conv_cuda.forward(inputs, weight, bias, kernel_size[0], kernel_size[1],
                                             stride[0], stride[1], padding[0], padding[1], use_bias) 
        return output
    
    @staticmethod
    def backward(ctx, grad_output):
        assert grad_output.is_cuda
        use_bias = ctx.use_bias
        kernel_size = ctx.kernel_size
        stride = ctx.stride
        padding = ctx.padding
        if use_bias:
            inputs, weight, bias = ctx.saved_variables
            output_grads = depthwise_conv_cuda.backward(grad_output.contiguous(), inputs, weight, bias, kernel_size[0],
                                                        kernel_size[1], stride[0], stride[1], padding[0],
                                                        padding[1], use_bias)

            inputs_grad, weight_grad, bias_grad = output_grads
            return  inputs_grad, weight_grad, bias_grad, None, None, None, None
        else:
            inputs, weight, bias = ctx.saved_variables
            output_grads = depthwise_conv_cuda.backward(grad_output.contiguous(), inputs, weight, bias, kernel_size[0],
                                                        kernel_size[1], stride[0], stride[1], padding[0],
                                                        padding[1], use_bias)
                
            inputs_grad, weight_grad = output_grads
            return  inputs_grad, weight_grad, None, None, None, None, None
       
depthwise_conv = DepthWiseConvFunction.apply


class DepthWiseConv2d(nn.Module):
    def __init__(self, in_channels, kernel_size=3, stride=1, padding=0, use_bias=False):
        super(DepthWiseConv2d, self).__init__()
        self.in_channels = in_channels
        self.kernel_size = _pair(kernel_size)
        self.stride = _pair(stride)
        self.padding = _pair(padding)
        self.use_bias = use_bias
        self.weight = Parameter(torch.Tensor(in_channels, 1, *self.kernel_size))
        if use_bias:
            self.bias = Parameter(torch.Tensor(in_channels))
        else:
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self):
        n = self.in_channels
        for k in self.kernel_size:
            n *= k
        stdv = 1. / math.sqrt(n)
        self.weight.data.uniform_(-stdv, stdv)
        if self.bias is not None:
            self.bias.data.uniform_(-stdv, stdv)

    def forward(self, inputs):
        return depthwise_conv(inputs, self.weight, self.bias, self.kernel_size, self.stride, self.padding, self.use_bias)



def testDepthWiseConvFunction():
    from torch.autograd import Variable
    from torch.nn import functional as F

    device = "cuda:0"
    inputs = torch.randn(4, 3, 7, 7).to(device)
    w = Variable(torch.randn(3, 1, 3, 3), requires_grad=True).to(device)
    b = Variable(torch.randn(3), requires_grad=True).to(device)

    # opt = F.conv2d(inputs, w, bias=None, stride=1, padding=0, dilation=1, groups=3) 
    # print(opt.size())

    inp = depthwise_conv(inputs, w, b, 3)
    loss = inp.sum()
    loss.backward()

def testDepthWiseConv2d():
    device = "cuda:0"
    inputs = torch.randn(4, 3, 7, 7).to(device)
    depthwcon = DepthWiseConv2d(3)
    depthwcon.to(device)
    outp = depthwcon(inputs)
    loss = outp.sum()
    loss.backward()
    print(outp.size())


if __name__ == '__main__':
   testDepthWiseConvFunction()
   testDepthWiseConv2d()

该代码就是基本的pytorch自定义层的实现步骤，可参考官网教程https://pytorch.org/docs/stable/notes/extending.html，唯一的区别是我们在 DepthWiseConvFunction层中我们的forward和backward具体计算我们是调用的自己通过源码扩展的，而不是通过pytorch的python接口函数进行组合出来的。

远1

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
5
评论
如何通过C++及CUDA扩展Pytorch

如何通过C++及CUDA扩展PytorchPytorch虽然已经使用了NVIDIA cuDNN、Intel MKL和NNPACK这些底层来加快训练速度，但是在某些情况下，比如我们要实现一些特定算法，光靠组合Pytorch已有的操作是不够的。这是因为Pytorch虽然在特定操作上经过了很好的优化，但是对于Pytorch已经写好的这些操作，假如我们组合起来，组成我们的新的算法，Pytorch才不管你...
复制链接

扫一扫

专栏目录