PyTorch 学习笔记 (3) PyTorch 1.0+ C++/CUDA extension

参考教程

PyTorch官方教程
PyTorch官方教程源码

期间参考了这位博主的一个教程

本机系统

Ubuntu 18.04 LTS
NVIDIA GeForce GTX 1080 with driver 430.50
CUDA V10.1.243
python 3.6.8 virtualenv
torch.version == ‘1.2.0’

源码

所有源码可在这里获取。

Visual Studio Code配置

由于我使用的是python 的virtual environment,PyTorch放置的c++头文件的位置需要手工制定之后vs code的c++插件才会找到相关的头文件。在本机系统上,virtual env的位置是
/home/yaoyu/p3pt
所需要制定的头文件位置为
/home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/torch/extension.h

Setup.py

在module的文件内新建setup.py文件,内容为

from setuptools import setup, Extension
from torch.utils import cpp_extension

setup(name="SigmoidCpp",
      ext_modules=[cpp_extension.CppExtension("SigmoidCpp", ['SigmoidCpp.cpp'])],
      cmdclass={'build_ext': cpp_extension.BuildExtension})

其中SigmoidCpp 是我们将要创建的c++ PyTorch extension 的python module的名字,SigmoidCpp.cpp是我们的c++源文件。

cpp

创建SigmoidCpp.cpp文件,并保存在setup.py文件边上。

#include <torch/extension.h>

#include <iostream>
#include <vector>

std::vector<torch::Tensor> sigmoid_cpp_forward( torch::Tensor input )
{
    return { torch::sigmoid( input ) };
}

std::vector<torch::Tensor> sigmoid_cpp_backward( torch::Tensor grad, torch::Tensor s )
{
    auto sp  = (1 - s) * s;

    return { grad * sp };
}

PYBIND11_MODULE( TORCH_EXTENSION_NAME, m )
{
    m.def("forward", &sigmoid_cpp_forward, "SigmoidCpp forward");
    m.def("backward", &sigmoid_cpp_backward, "SigmoidCpp backward");
}

这里sigmoid_cpp_forwad()sigmoid_cpp_backward()我都使用了std::vector作为返回值,python binding过程会将这些std::vector转换为python list。这个简单实例其实并不需要std::vector,这里仅作一个演示如何输出多个tensor。

此后,在module目录内运行

python setup.py build_ext

若没有任何编译链接错误的话,会显示成功build了build/lib.linux-x86_64-3.6/SigmoidCpp.cpython-36m-x86_64-linux-gnu.so。此处build目录会在当前module目录下自动创建。

执行

python setup.py install --record .installed_files.txt

将刚刚编译完成的package安装到python环境中,由于目前使用的是virtual env,所以将会把文件安装到veritual env的目录内,我们可以显示.installed_files.txt的内容以查看已经被安装的文件。在调试过程中,可以通过如下命令来删除这些安装入python环境的文件

cat .installed_files.txt | xargs rm -rf

测试package

启动python或ipython,尝试如下命令

import torch
import SigmoidCpp

help(SigmoidCpp.forward)

将显示如下结果

Help on built-in function forward in module SigmoidCpp:

forward(...) method of builtins.PyCapsule instance
    forward(arg0: at::Tensor) -> List[at::Tensor]
    
    SigmoidCpp forward

本机显示的python function的signature和PyTorch官方教程的略有出入,主要体现在函数的参数类型上,这里都是ATen的namespace而PyTorch官方教程都是torch的namespace。这里需要注意,首先要import torch否则可能会报一些奇怪的c++动态库加载失败的错误。同样,可以help(SigmoidCpp.backward)

Help on built-in function backward in module SigmoidCpp:

backward(...) method of builtins.PyCapsule instance
    backward(arg0: at::Tensor, arg1: at::Tensor) -> List[at::Tensor]
    
    SigmoidCpp backward

创建实例化autograd.Function

创建一个新的python文件,这里我命名为SigmoidCppAG.py, AG意为AutoGradient,我知道这命名有点二,是我随手想的没有深入考究。文件内容可为

import torch

import SigmoidCpp

class SigmoidCppFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        s = SigmoidCpp.forward(x)
        ctx.save_for_backward( s[0] )
        return s[0]

    @staticmethod
    def backward(ctx, grad):
        sv = ctx.saved_variables

        output = SigmoidCpp.backward( grad, sv[0] )

        return output[0]

class SigmoidCppM(torch.nn.Module):
    def __init__(self):
        super(SigmoidCppM, self).__init__()
    
    def forward(self, x):
        return SigmoidCppFunction.apply( x )

这里,根据PyTorch官方教程的推荐,我们同时实例化torch.autograd.Functiontorch.nn.Module并且在实例化torch.nn.Module时使用先实例化好的torch.autograd.Function。有几点需要注意

  • 实例化torch.autograd.Function时,forwardbackward函数是静态的。
  • SigmoidCpp是我们利用c++扩展的package的名称,定义在setup.py文件中。
  • SigmoidCpp.forwardSigmoidCpp.backward函数是在c++扩展中实现的sigmoid_cpp_forwardsigmoid_cpp_backward函数,但是利用PYBIND11_MODULE宏映射成了fowardbackward
  • 根据我们当前的实例,SigmoidCpp.forward将返回一个list,SigmoidCpp.backward也将返回一个list。
  • 无论ctx.save_for_backward打包了几个variable,ctx.saved_variables都将返回一个list。

在本机上设计一个简单的测试脚本,用于测试SigmoidCppAG.py的正确性。这个脚本参考了上一个PyTorch学习笔记

import torch
from SigmoidCppAG import SigmoidCppM

if __name__ == "__main__":
    dc = SigmoidCppM()

    x = torch.rand((2,2), requires_grad=True)  # The input data.
    Y = torch.rand((2,2), requires_grad=False) # The true data.

    # Forward.
    y = dc(x)

    # Compute the loss.
    L = Y - y

    # Backward.
    L.backward(torch.ones(2,2))

    print("x = {}. ".format(x))
    print("y = {}. ".format(y))
    print("x.grad = {}. ".format( x.grad ))

    # Manually compute the radient.
    pLpx = -1.0 * (1.0 - y) * y

    print("PartialLPartialX = {}. ".format(pLpx))

在本机上执行该脚本的输出将类似于

x = tensor([[0.1945, 0.4618],
        [0.5555, 0.2470]], requires_grad=True). 
y = tensor([[0.5485, 0.6134],
        [0.6354, 0.5614]], grad_fn=<SigmoidCppFunctionBackward>). 
x.grad = tensor([[-0.2476, -0.2371],
        [-0.2317, -0.2462]]). 
PartialLPartialX = tensor([[-0.2476, -0.2371],
        [-0.2317, -0.2462]], grad_fn=<MulBackward0>). 

可以看出x.grad的数值与pLpx的数值是一致的,说明L相对于x的梯度求解正确。

测试GPU

采用此类c++扩展设计的PyTorch module,可以直接支持GPU运算,将我们的测试代码稍作修改

import torch
from torch.autograd import Variable

from SigmoidCppAG import SigmoidCppM

if __name__ == "__main__":
    dc = SigmoidCppM()

    x = Variable( torch.rand((2,2)).cuda(), requires_grad=True )  # The input data.
    Y = Variable( torch.rand((2,2)).cuda(), requires_grad=False ) # The true data.

    # Forward.
    y = dc(x)

    # Compute the loss.
    L = Y - y

    # Backward.
    L.backward(torch.ones(2,2).cuda())

    print("x = {}. ".format(x))
    print("y = {}. ".format(y))
    print("x.grad = {}. ".format( x.grad ))

    # Manually compute the radient.
    pLpx = -1.0 * (1.0 - y) * y

    print("PartialLPartialX = {}. ".format(pLpx))

再次运行可得到相同的结果(当然随机的还是随机的),

x = tensor([[0.5006, 0.0769],
        [0.2221, 0.4054]], device='cuda:0', requires_grad=True). 
y = tensor([[0.6226, 0.5192],
        [0.5553, 0.6000]], device='cuda:0',
       grad_fn=<SigmoidCppFunctionBackward>). 
x.grad = tensor([[-0.2350, -0.2496],
        [-0.2469, -0.2400]], device='cuda:0'). 
PartialLPartialX = tensor([[-0.2350, -0.2496],
        [-0.2469, -0.2400]], device='cuda:0', grad_fn=<MulBackward0>). 

由于这个实例过于简单,执行时反而会感觉很慢,因为要启动一个CUDA context,这个overhead还是能感觉到的。

CUDA

现在我们尝试将c++扩展module修改为Mixed C++/CUDA形式的。

修改原始cpp文件

首先将原始的cpp文件修改为调用CUDA 版本的forward和backward函数,这里参考了官方教程的实现方案。

#include <torch/extension.h>

#include <vector>

// CUDA interfaces.
std::vector<torch::Tensor> sigmoid_cpp_forward_cuda( torch::Tensor input );
std::vector<torch::Tensor> sigmoid_cpp_backward_cuda( torch::Tensor grad, torch::Tensor s );

// C++ interfaces.

#define CHECK_CUDA(x) TORCH_CHECK(x.type().is_cuda(), #x " must be a CUDA tensor. ")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous. ")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

std::vector<torch::Tensor> sigmoid_cpp_forward( torch::Tensor input )
{
    CHECK_INPUT(input);

    return sigmoid_cpp_forward_cuda(input);
}

std::vector<torch::Tensor> sigmoid_cpp_backward( torch::Tensor grad, torch::Tensor s )
{
    CHECK_INPUT(grad);
    CHECK_INPUT(s);

    return sigmoid_cpp_backward_cuda(grad, s);
}

PYBIND11_MODULE( TORCH_EXTENSION_NAME, m )
{
    m.def("forward", &sigmoid_cpp_forward, "SigmoidCpp forward, CUDA version. ");
    m.def("backward", &sigmoid_cpp_backward, "SigmoidCpp backward, CUDA version. ");
}

上述代码中sigmoid_cpp_forward_cuda()sigmoid_cpp_backward_cuda()仅是两个尚未实现的函数声明。这两个函数将在下面的SigmoidCpp_Kernel.cu中实现。(Visual Studio Code提供了CUDA插件,语法高亮更给力!)

根据我往常的习惯,k_前缀的函数为CUDA kernel函数,d_前缀的为CUDA device函数。实例中,我们假定将要处理一个3维输入,维度为(B, H, W),B可理解为miniBatch。我们要在这样的数据结构中进行element-wise 的sigmoid函数运算。这个实例简单地将CUDA kernel函数的launch参数限制为每个block 2x2x1的thread排布, 然后根据B在launch时指定变化个数的blocks。 每个block将利用2x2个线程处理一个batch上的2维数据。总之是很简单的CUDA逻辑。未来当输入数据的维度类似于(B, C, H, W)时,可以设计每个block内的线程数在第三维度上与C相关。

本实例直接使用了官方推荐的PackedTensorAccessor数据结构,此时不必再为如何在高维矩阵中进行索引而头痛了。

#include <torch/extension.h>

#include <cuda.h>
#include <cuda_runtime.h>

#include <vector>

// ========== Device functions. ==========

template <typename scalar_t> 
__device__ __forceinline__ scalar_t d_sigmoid(scalar_t x)
{
    return 1.0 / ( 1.0 + exp(-x) );
}

// ========== Kernel functions. ==========

template <typename scalar_t>
__global__ void k_sigmoid_cpp_forward( 
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> input,
    torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> output )
{
    const int idxX    = blockIdx.x * blockDim.x + threadIdx.x;
    const int idxY    = blockIdx.y * blockDim.y + threadIdx.y;
    const int idxZ    = blockIdx.z * blockDim.z + threadIdx.z;
    const int strideX = gridDim.x * blockDim.x;
    const int strideY = gridDim.y * blockDim.y;
    const int strideZ = gridDim.z * blockDim.z;

    const int b = input.size(0);
    const int h = input.size(1);
    const int w = input.size(2);

    for ( int z = idxZ; z < b; z += strideZ )
    {
        for ( int y = idxY; y < h; y += strideY )
        {
            for ( int x = idxX; x < w; x += strideX )
            {
                output[z][y][x] = d_sigmoid( input[z][y][x] );
            }
        }
    }
}

template <typename scalar_t> 
__global__ void k_sigmoid_cpp_backward(
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> grad,
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> s,
    torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> output )
{
    const int idxX    = blockIdx.x * blockDim.x + threadIdx.x;
    const int idxY    = blockIdx.y * blockDim.y + threadIdx.y;
    const int idxZ    = blockIdx.z * blockDim.z + threadIdx.z;
    const int strideX = gridDim.x * blockDim.x;
    const int strideY = gridDim.y * blockDim.y;
    const int strideZ = gridDim.z * blockDim.z;

    const int b = s.size(0);
    const int h = s.size(1);
    const int w = s.size(2);

    for (int z = idxZ; z < b; z += strideZ )
    {
        for ( int y = idxY; y < h; y += strideY )
        {
            for ( int x = idxX; x < w; x += strideX )
            {
                output[z][y][x] = 
                    grad[z][y][x] * 
                    ( 1.0 - s[z][y][x] ) * s[z][y][x];
            }
        }
    }
}

// ========== Interface functions. ==========

std::vector<torch::Tensor> sigmoid_cpp_forward_cuda( torch::Tensor input )
{
    // Get the batch size.
    auto b = input.size(0);

    // The 2D tensor dimensions.
    auto h = input.size(1);
    auto w = input.size(2);

    // Prepare output.
    auto output = torch::zeros_like(input);

    const int threadsX = 2;
    const int threadsY = 2;

    // Kernal launch dimensions.
    const dim3 blocks( ( w + threadsX - 1 ) / threadsX, ( h + threadsY - 1 ) / threadsY, b );
    const dim3 thrds( threadsX, threadsY, 1 );

    // Kernal launch.
    AT_DISPATCH_FLOATING_TYPES( input.type(), "sigmoid_cpp_forwrd_cuda", ([&] {
        k_sigmoid_cpp_forward<scalar_t><<<blocks, thrds>>>( 
            input.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            output.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>() );
    }) );

    return { output };
}

std::vector<torch::Tensor> sigmoid_cpp_backward_cuda( torch::Tensor grad, torch::Tensor s )
{
    // Get the batch size.
    auto b = s.size(0);

    // Get the 2D tensor dimesnions.
    auto h = s.size(1);
    auto w = s.size(2);

    // The result.
    auto output = torch::zeros_like(s);

    const int threadsX = 2;
    const int threadsY = 2;

    // Kernal launch dimensions.
    const dim3 blocks( ( w + threadsX - 1 ) / threadsX, ( h + threadsY - 1 ) / threadsY, b );
    const dim3 thrds( threadsX, threadsY, 1 );

    // Kernal launch.
    AT_DISPATCH_FLOATING_TYPES( s.type(), "sigmoid_cpp_backward_cuda", ( [&] {
        k_sigmoid_cpp_backward<scalar_t><<<blocks, thrds>>>( 
            grad.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            s.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            output.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>() );
    } ) );

    return { output };
}

修改setup.py

根据官方教程setup.py文件变为

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
      name="SigmoidCppCUDA",
      ext_modules=[
            CUDAExtension("SigmoidCppCUDA", [
                  'SigmoidCpp.cpp',
                  'SigmoidCpp_Kernel.cu',
                  ] )
            ],
      cmdclass={
            'build_ext': BuildExtension
            }
      )

利用python setup.py build_ext进行编译链接。在本机目前的配置上,会报出一些兼容性的warning。

SigmoidCpp_Kernel.cu: In lambda function:
SigmoidCpp_Kernel.cu:101:98: warning: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated [-Wdeprecated-declarations]
     AT_DISPATCH_FLOATING_TYPES( input.type(), "sigmoid_cpp_forwrd_cuda", ([&] {
                                                                                                  ^
/home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/ATen/Dispatch.h:78:1: note: declared here
 inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties &t) {
 ^~~~~~~~~~~

import测试

打开ipython,直接输入import SigmoidCppCUDA将会报出如下错误(在非CUDA版本的import测试时亦是如此)

In [1]: import SigmoidCppCUDA                                                                     
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-a62db65a4354> in <module>
----> 1 import SigmoidCppCUDA

ImportError: /home/yaoyu/p3pt/lib/python3.6/site-packages/SigmoidCppCUDA-0.0.0-py3.6-linux-x86_64.egg/SigmoidCppCUDA.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail37_typeMetaDataInstance_preallocated_32E

其原因尚不太明确,但是解决方法却很简单,在import SigmoidCppCUDA之前,首先import torch。此时错误消失,可以通过help函数查看我们生成的接口。

help(SigmoidCppCUDA.forward)

Help on built-in function forward in module SigmoidCppCUDA:

forward(...) method of builtins.PyCapsule instance
    forward(arg0: at::Tensor) -> List[at::Tensor]
    
    SigmoidCpp forward, CUDA version.

实现AutoGrad

与非CUDA版本类似,我们修改SigmoidCppAG.py

import torch

import SigmoidCppCUDA

class SigmoidCppFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        s = SigmoidCppCUDA.forward(x)
        ctx.save_for_backward( s[0] )
        return s[0]

    @staticmethod
    def backward(ctx, grad):
        sv = ctx.saved_variables

        output = SigmoidCppCUDA.backward( grad, sv[0] )

        return output[0]

class SigmoidCppM(torch.nn.Module):
    def __init__(self):
        super(SigmoidCppM, self).__init__()
    
    def forward(self, x):
        return SigmoidCppFunction.apply( x )

测试CUDA 版本SigmoidCpp

为了简化测试,设计了一个稍微复杂的脚本


import argparse

import torch
from torch.autograd import Variable

from SigmoidCppAG import SigmoidCppM

def parse_dims(s):
    """
    s is a string in the formate of "BxHxW.", 
    weher B, H, and W are positive integers.
    """

    n = s.split("x")

    if (len(n) != 3):
        raise Exception("len(n) = %d. " % (len(n)))

    return int(n[0]), int(n[1]), int(n[2])

if __name__ == "__main__":
    # Handle arguments.
    parser = argparse.ArgumentParser(description="Test SigmoidCppCUDA.")

    parser.add_argument("--dim", type=str, default="2x4x4", \
        help="The dimensions to be tested. Must be 3 integers separated by 'x'")

    parser.add_argument("--show-details", action="store_true", default=False, \
        help="Show the actual values of the tensors. ")
    
    args = parser.parse_args()

    # Get the test dimensions.
    B, H, W = parse_dims(args.dim)

    assert B > 0
    assert H > 0
    assert W > 0

    # Create the new SigmoidCppM layer.
    dc = SigmoidCppM()

    x = Variable( torch.rand((B, H, W)).cuda(), requires_grad=True )  # The input data.
    Y = Variable( torch.rand((B, H, W)).cuda(), requires_grad=False ) # The true data.

    # Forward.
    y = dc(x)

    # Compute the loss.
    L = Y - y

    # Backward.
    L.backward(torch.ones(B, H, W).cuda())

    if ( args.show_details ):
        print("x = {}. ".format(x))
        print("y = {}. ".format(y))
        print("x.grad = {}. ".format( x.grad ))

    # Manually compute the radient.
    pLpx = -1.0 * (1.0 - y) * y

    if ( args.show_details ):
        print("PartialLPartialX = {}. ".format(pLpx))

    # Compute the error.
    e = pLpx - x.grad
    print("torch.norm(e) = {}. ".format( torch.norm(e) ))

执行脚本时可通过--dim--show-details控制tensor维度和输出内容。以下是--dim "4x4x4"输出结果,当然具体数据是随机生成的。(请不要直接指定过大的维度,因为每个block毕竟只有2x2个线程。)

python TestSigmoidCppAGCUDA.py --dim "4x4x4" --show-details
x = tensor([[[0.1786, 0.1014, 0.9765, 0.0538],
         [0.2067, 0.9495, 0.8037, 0.2129],
         [0.7946, 0.3683, 0.7087, 0.8855],
         [0.1146, 0.0669, 0.8119, 0.6183]],

        [[0.0649, 0.2158, 0.5043, 0.9226],
         [0.2660, 0.7767, 0.4650, 0.6800],
         [0.9747, 0.8680, 0.5171, 0.5489],
         [0.4577, 0.9509, 0.3144, 0.6033]],

        [[0.5792, 0.7980, 0.2871, 0.2417],
         [0.0321, 0.0926, 0.3278, 0.4340],
         [0.6546, 0.9834, 0.0917, 0.9409],
         [0.2512, 0.9886, 0.7203, 0.9157]],

        [[0.9405, 0.1793, 0.0754, 0.8796],
         [0.3422, 0.4109, 0.4934, 0.6166],
         [0.1838, 0.4501, 0.1483, 0.3856],
         [0.6885, 0.8647, 0.0581, 0.9128]]], device='cuda:0',
       requires_grad=True). 
y = tensor([[[0.5445, 0.5253, 0.7264, 0.5134],
         [0.5515, 0.7210, 0.6908, 0.5530],
         [0.6888, 0.5911, 0.6701, 0.7080],
         [0.5286, 0.5167, 0.6925, 0.6498]],

        [[0.5162, 0.5538, 0.6235, 0.7156],
         [0.5661, 0.6850, 0.6142, 0.6637],
         [0.7260, 0.7043, 0.6265, 0.6339],
         [0.6125, 0.7213, 0.5780, 0.6464]],

        [[0.6409, 0.6896, 0.5713, 0.5601],
         [0.5080, 0.5231, 0.5812, 0.6068],
         [0.6580, 0.7278, 0.5229, 0.7193],
         [0.5625, 0.7288, 0.6727, 0.7142]],

        [[0.7192, 0.5447, 0.5188, 0.7067],
         [0.5847, 0.6013, 0.6209, 0.6494],
         [0.5458, 0.6107, 0.5370, 0.5952],
         [0.6656, 0.7036, 0.5145, 0.7136]]], device='cuda:0',
       grad_fn=<SigmoidCppFunctionBackward>). 
x.grad = tensor([[[-0.2480, -0.2494, -0.1987, -0.2498],
         [-0.2473, -0.2012, -0.2136, -0.2472],
         [-0.2143, -0.2417, -0.2211, -0.2067],
         [-0.2492, -0.2497, -0.2129, -0.2275]],

        [[-0.2497, -0.2471, -0.2348, -0.2035],
         [-0.2456, -0.2158, -0.2370, -0.2232],
         [-0.1989, -0.2082, -0.2340, -0.2321],
         [-0.2374, -0.2010, -0.2439, -0.2286]],

        [[-0.2302, -0.2141, -0.2449, -0.2464],
         [-0.2499, -0.2495, -0.2434, -0.2386],
         [-0.2250, -0.1981, -0.2495, -0.2019],
         [-0.2461, -0.1976, -0.2202, -0.2041]],

        [[-0.2019, -0.2480, -0.2496, -0.2073],
         [-0.2428, -0.2397, -0.2354, -0.2277],
         [-0.2479, -0.2378, -0.2486, -0.2409],
         [-0.2226, -0.2085, -0.2498, -0.2044]]], device='cuda:0'). 
PartialLPartialX = tensor([[[-0.2480, -0.2494, -0.1987, -0.2498],
         [-0.2473, -0.2012, -0.2136, -0.2472],
         [-0.2143, -0.2417, -0.2211, -0.2067],
         [-0.2492, -0.2497, -0.2129, -0.2275]],

        [[-0.2497, -0.2471, -0.2348, -0.2035],
         [-0.2456, -0.2158, -0.2370, -0.2232],
         [-0.1989, -0.2082, -0.2340, -0.2321],
         [-0.2374, -0.2010, -0.2439, -0.2286]],

        [[-0.2302, -0.2141, -0.2449, -0.2464],
         [-0.2499, -0.2495, -0.2434, -0.2386],
         [-0.2250, -0.1981, -0.2495, -0.2019],
         [-0.2461, -0.1976, -0.2202, -0.2041]],

        [[-0.2019, -0.2480, -0.2496, -0.2073],
         [-0.2428, -0.2397, -0.2354, -0.2277],
         [-0.2479, -0.2378, -0.2486, -0.2409],
         [-0.2226, -0.2085, -0.2498, -0.2044]]], device='cuda:0',
       grad_fn=<MulBackward0>). 
torch.norm(e) = 0.0. 

最后的norm为零,表示结果是正确的。

JIT

官方教程指出,我们可以使用JIT方式进行即时编译。这里提供了一个测试脚本jit.py

from torch.utils.cpp_extension import load

SigmoidCppCUDA = load( 
    name="DispCorrCUDA", 
    sources=[ "SigmoidCpp.cpp", "SigmoidCpp_Kernel.cu"],
    verbose=True
 )

help(SigmoidCppCUDA)

运行该脚本,得到的结果如下。(若系统没有安装ninja,需要通过pip 先安装好ninja。本机的ninja版本是1.9.0.post1。)

Using /tmp/torch_extensions as PyTorch extensions root...
Creating extension directory /tmp/torch_extensions/SigmoidCppCUDA...
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/torch_extensions/SigmoidCppCUDA/build.ninja...
Building extension module SigmoidCppCUDA...
[1/3] c++ -MMD -MF SigmoidCpp.o.d -DTORCH_EXTENSION_NAME=SigmoidCppCUDA -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/TH -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.1/include -isystem /home/yaoyu/p3pt/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++11 -c /home/yaoyu/Projects/DeepLearningModels/PyTorchExtensionBlog/SigmoidCppCUDA/SigmoidCpp.cpp -o SigmoidCpp.o
[2/3] /usr/local/cuda-10.1/bin/nvcc -DTORCH_EXTENSION_NAME=SigmoidCppCUDA -DTORCH_API_INCLUDE_EXTENSION_H -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/TH -isystem /home/yaoyu/p3pt/lib/python3.6/site-packages/torch/include/THC -isystem /usr/local/cuda-10.1/include -isystem /home/yaoyu/p3pt/include/python3.6m -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -std=c++11 -c /home/yaoyu/Projects/DeepLearningModels/PyTorchExtensionBlog/SigmoidCppCUDA/SigmoidCpp_Kernel.cu -o SigmoidCpp_Kernel.cuda.o
[3/3] c++ SigmoidCpp.o SigmoidCpp_Kernel.cuda.o -shared -L/usr/local/cuda-10.1/lib64 -lcudart -o SigmoidCppCUDA.so
Loading extension module SigmoidCppCUDA...


Help on module SigmoidCppCUDA:

NAME
    SigmoidCppCUDA

FUNCTIONS
    backward(...) method of builtins.PyCapsule instance
        backward(arg0: at::Tensor, arg1: at::Tensor) -> List[at::Tensor]
        
        SigmoidCpp backward, CUDA version.
    
    forward(...) method of builtins.PyCapsule instance
        forward(arg0: at::Tensor) -> List[at::Tensor]
        
        SigmoidCpp forward, CUDA version.

FILE
    /tmp/torch_extensions/SigmoidCppCUDA/SigmoidCppCUDA.so

后记

本篇笔记就先写到这,关于PyTorch的c++扩展,0.4版本与1.0后的版本是不兼容的,这点PyTorch的issue track上也讨论了。说不定未来还要有变化,到时候再说了。

发布了73 篇原创文章 · 获赞 24 · 访问量 13万+
展开阅读全文

没有更多推荐了,返回首页

©️2019 CSDN 皮肤主题: 大白 设计师: CSDN官方博客

分享到微信朋友圈

×

扫一扫,手机浏览