Pytorch中实现CUDA加速【来自AI葵 pytorch-cppcuda-tutorial】

空客A380

于 2024-06-03 12:06:00 发布

阅读量813

点赞数 22

文章标签： pytorch 人工智能 python

本文链接：https://blog.csdn.net/RobinA380/article/details/139411159

版权

【本教程主要参考自kwea123】
GitHub - kwea123/pytorch-cppcuda-tutorial: tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)

目的

实现GPU加速计算，场景有非平行运算和大量串行运算两种，本教程主要针对第一种如体素渲染的应用。
非平行运算：像体素渲染那样的，由于每次采样点个数不一样，计算过程有差别，但有相似性又需要大量重复计算。（对比像神经网络那样的，每一次计算过程完全一样，只是数据不一样的叫平行运算。）

准备

CUDA+cudnn环境
VS Code配置C++环境，依然优雅🐶；
Anaconda提供虚拟环境

Part1 Pytorch调用cpp

步骤：
conda创建虚拟环境
pip安装pytorch（已经习惯了本地.whl安装方式，稳定！）
vs code里新建interpolation.cpp

#include <torch/extension.h>

int main() {
    cout<<"hello"<<endl;
}

Ctrl+Shift+P，C/C++编辑配置，添加includePath，会在.vscode目录下新建c_cpp_properties.json
其实，这一步只是为了消除代码中include找不到.h头文件产生红色下划线的问题，虽然我这里还是会有，但其实只是代码提示作用，并不影响。
以下分割线部分，误入歧途，慎入。

此时，调试运行程序还是会报错找不到.h库文件。
其实，正常的添加第三方库是需要写CMakeLists.txt！
其实，pytroch官方也给出了调试运行过程：Installing C++ Distributions of PyTorch — PyTorch main documentation

example-app/
  CMakeLists.txt
  example-app.cpp

cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

if(NOT Torch_FOUND)
  message(FATAL_ERROR "Pytorch Not Found！")
endif(NOT Torch_FOUND)

message(STATUS "Pytorch status: ")
message(STATUS "  libraries: ${TORCH_LIBRARIES}")
message(STATUS "  prefix: ${TORCH_INSTALL_PREFIX}")

add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

# The following code block is suggested to be used on Windows.
# According to https://github.com/pytorch/pytorch/issues/25457,
# the DLLs need to be copied to avoid memory errors.
if (MSVC)
  file(GLOB TORCH_DLLS "${TORCH_INSTALL_PREFIX}/lib/*.dll")
  add_custom_command(TARGET example-app
                     POST_BUILD
                     COMMAND ${CMAKE_COMMAND} -E copy_if_different
                     ${TORCH_DLLS}
                     $<TARGET_FILE_DIR:example-app>)
endif (MSVC)

#include <torch/torch.h>
#include <iostream>

int main() {
  torch::Tensor tensor = torch::rand({2, 3});
  std::cout << tensor << std::endl;
}

mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=/absolute/path/to/libtorch ..
cmake --build . --config Release

备注：这里的绝对路径可以是下载的libtorch（https://pytorch.org/，见下图），也可以是之前下载的pytorch（例如D:\pytorch-cpp-cuda\venv\Lib\site-packages\torch）

运行结果：
PS D:\pytorch-cpp-cuda\build\Release> .\example-app
0.0059 0.7112 0.7168
0.0425 0.4746 0.4270
[ CPUFloatType{2,3} ]
其实，已经跑偏了！！！！
以上做法是进行C++调试项目，而本教程的目的是**C++包装算子，**也就是通过后面的setup.py打包，心急吃不了热豆腐！

言归正传，
interpolation.cpp

#include <torch/extension.h>

torch::Tensor trilinear_interpolation_cpp(
    torch::Tensor feats,
    torch::Tensor point
){
    return feats;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
    m.def("trilinear_interpolation_py", &trilinear_interpolation_cpp);
}

setup.py

from setuptools import setup
from torch.utils.cpp_extension import CppExtension,BuildExtension

setup(
    name='cppcuda_tutorial',
    version='1.0',
    author='Robin',
    author_email='704805264@qq.com',
    description='cppcuda example',
    long_description='cppcuda example',
    ext_modules=[
        CppExtension(
            name='cppcuda_tutorial',
            sources=['interpolation.cpp']
        )
    ],
    cmdclass={
        'build_ext':BuildExtension
    }
)

先升级pip：python -m pip install pip -U
再编译安装：pip install .
备注：打开终端快捷键Ctrl+~
test.py

import torch
import cppcuda_tutorial

feats=torch.ones(2)
point=torch.zeros(2)

out=cppcuda_tutorial.trilinear_interpolation_py(feats,point)

print(out)

终端运行python test.py，输出结果
PS D:\pytorch-cpp-cuda> python .\test.py
tensor([1., 1.])
没问题！
总结逻辑关系图如下：

Part2 Pytorch经cpp调用CUDA

新建interpolation_kernal.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    torch::Tensor feats,
    torch::Tensor points
){
    return feats;
}

对应的头文件：新建include目录下utils.h

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
);

修改interpolation.cpp

#include <utils.h>

torch::Tensor trilinear_interpolation(
    torch::Tensor feats,
    torch::Tensor points
){
    CHECK_INPUT(feats);
    CHECK_INPUT(points);
    return trilinear_fw_cu(feats,points);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
    m.def("trilinear_interpolation", &trilinear_interpolation);
}

注意：#include的是<utils.h>，而不是<include/utils.h>，否则pip安装时报找不到该文件的错！我是要把所有的坑踩完吗？😭
修改setup.py

import glob
import os.path as osp
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension


ROOT_DIR = osp.dirname(osp.abspath(__file__))
include_dirs = [osp.join(ROOT_DIR, "include")]

sources = glob.glob('*.cpp')+glob.glob('*.cu')


setup(
    name='cppcuda_tutorial',
    version='1.0',
    author='kwea123',
    author_email='kwea123@gmail.com',
    description='cppcuda_tutorial',
    long_description='cppcuda_tutorial',
    ext_modules=[
        CUDAExtension(
            name='cppcuda_tutorial',
            sources=sources,
            include_dirs=include_dirs,
            extra_compile_args={'cxx': ['-O2'],
                                'nvcc': ['-O2']}
        )
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

终端pip install .安装
修改test.py

import torch
import cppcuda_tutorial
import os

if __name__=='__main__':
    feats=torch.ones(2,device='cuda')
    point=torch.zeros(2,device='cuda')

    out=cppcuda_tutorial.trilinear_interpolation(feats,point)

    print(out)

总结逻辑关系图如下：

Part3 CUDA原理

GPU原理：

修改interpolation_kernel.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    torch::Tensor feats,
    torch::Tensor points
){
    const int N=feats.size(0),F=feats.size(2);
    torch::Tensor feat_interp=torch::zeros({N,F},feats.options());

    const dim3 threads(16,16);
    const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
    ([&]{
        trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
            feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
            points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
            feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
        );
    }));
    return feat_interp;
}

Part4 CUDA实现

解决三线性插值问题

修改interpolation_kernel.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

template <typename scalar_t>
__global__ void trilinear_fw_kernel(
const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;
    
    //覆盖的某些部分并没有数据，不需要计算
    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;

    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;
    feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
        b*feats[n][1][f] +
        c*feats[n][2][f] +
        d*feats[n][3][f]) + 
        u*(a*feats[n][4][f] +
        b*feats[n][5][f] +
        c*feats[n][6][f] +
        d*feats[n][7][f]);
}

torch::Tensor trilinear_fw_cu(
torch::Tensor feats,
torch::Tensor points
){
    const int N=feats.size(0),F=feats.size(2);
    torch::Tensor feat_interp=torch::zeros({N,F},feats.options());

    const dim3 threads(16,16);
    const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
    ([&]{
        trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
            feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
            points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
            feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
            );
    }));
    return feat_interp;
}

test.py中用python直接实现，与cuda结果比较是否一致，并比较计算时间。

import torch
import cppcuda_tutorial
import time

def trilinear_interpolation_py_nocuda(feats, points):
    u = (points[:, 0:1]+1)/2
    v = (points[:, 1:2]+1)/2
    w = (points[:, 2:3]+1)/2
    a = (1-v)*(1-w)
    b = (1-v)*w
    c = v*(1-w)
    d = 1-a-b-c

    feats_interp = (1-u)*(a*feats[:, 0] +
                          b*feats[:, 1] +
                          c*feats[:, 2] +
                          d*feats[:, 3]) + \
                       u*(a*feats[:, 4] +
                          b*feats[:, 5] +
                          c*feats[:, 6] +
                          d*feats[:, 7])
    
    return feats_interp

if __name__=='__main__':
    N=80000
    F=256
    feats=torch.rand(N,8,F,device='cuda')
    points=torch.rand(N,3,device='cuda')*2-1

    t = time.time()
    out_py_cuda = cppcuda_tutorial.trilinear_interpolation_py(feats, points)
    torch.cuda.synchronize()
    print('   cuda fw time: ', time.time()-t, 's')

    t = time.time()
    out_py_nocuda = trilinear_interpolation_py_nocuda(feats, points)
    torch.cuda.synchronize()
    print('pytorch fw time: ', time.time()-t, 's')
    
    print('fw all close: ', torch.allclose(out_py_cuda, out_py_nocuda))

备注：torch.cuda.synchronize()-CSDN博客
输出结果：
cuda fw time: 0.021244049072265625 s
pytorch fw time: 0.05333447456359863 s
fw all close: True

Part5 CUDA反向传播

cuda计算的结果不具有梯度（可以打印out_py_cuda.requires_grad查看），不能进行loss的反向传播。
$\frac{\partial f}{\partial f_{1}}=\left(1-u\right)\left(1-v\right)~~~~~~~~~ \frac{\partial f}{\partial f_{3}}=\left(1-u\right)v\\\frac{\partial f}{\partial f_{2}}=u\left(1-v\right)~~~~~~~~~~~~~~~~~~~~\frac{\partial f}{\partial f_{4}}=uv$

$\frac{\partial L}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot\frac{\partial f}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot(1-u)(1-v)$ ，……

修改interpolation_kernel.cu：

#include <torch/extension.h>


template <typename scalar_t>
__global__ void trilinear_fw_kernel(
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
    torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;

    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;
    
    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;
    feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
                               b*feats[n][1][f] +
                               c*feats[n][2][f] +
                               d*feats[n][3][f]) + 
                            u*(a*feats[n][4][f] +
                               b*feats[n][5][f] +
                               c*feats[n][6][f] +
                               d*feats[n][7][f]);
}


torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
){
    const int N = feats.size(0), F = feats.size(2);
    
    torch::Tensor feat_interp = torch::empty({N, F}, feats.options());

    const dim3 threads(16, 16);
    const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_fw_cu", 
    ([&] {
        trilinear_fw_kernel<scalar_t><<<blocks, threads>>>(
            feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            feat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>()
        );
    }));

    return feat_interp;
}


template <typename scalar_t>
__global__ void trilinear_bw_kernel(
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> dL_dfeat_interp,
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
    torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> dL_dfeats
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;

    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;
    
    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;

    dL_dfeats[n][0][f] = (1-u)*a*dL_dfeat_interp[n][f];
    dL_dfeats[n][1][f] = (1-u)*b*dL_dfeat_interp[n][f];
    dL_dfeats[n][2][f] = (1-u)*c*dL_dfeat_interp[n][f];
    dL_dfeats[n][3][f] = (1-u)*d*dL_dfeat_interp[n][f];
    dL_dfeats[n][4][f] = u*a*dL_dfeat_interp[n][f];
    dL_dfeats[n][5][f] = u*b*dL_dfeat_interp[n][f];
    dL_dfeats[n][6][f] = u*c*dL_dfeat_interp[n][f];
    dL_dfeats[n][7][f] = u*d*dL_dfeat_interp[n][f];
}


torch::Tensor trilinear_bw_cu(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
){
    const int N = feats.size(0), F = feats.size(2);
    
    torch::Tensor dL_dfeats = torch::empty({N, 8, F}, feats.options());

    const dim3 threads(16, 16);
    const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_bw_cu", 
    ([&] {
        trilinear_bw_kernel<scalar_t><<<blocks, threads>>>(
            dL_dfeat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            dL_dfeats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>()
        );
    }));

    return dL_dfeats;
}

修改interpolation.cpp和utils.h，添加上backward部分

#include "utils.h"


torch::Tensor trilinear_interpolation_fw(
    const torch::Tensor feats,
    const torch::Tensor points
){
    CHECK_INPUT(feats);
    CHECK_INPUT(points);

    return trilinear_fw_cu(feats, points);
}


torch::Tensor trilinear_interpolation_bw(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
){
    CHECK_INPUT(dL_dfeat_interp);
    CHECK_INPUT(feats);
    CHECK_INPUT(points);

    return trilinear_bw_cu(dL_dfeat_interp, feats, points);
}


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
    m.def("trilinear_interpolation_fw", &trilinear_interpolation_fw);
    m.def("trilinear_interpolation_bw", &trilinear_interpolation_bw);
}

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
);


torch::Tensor trilinear_bw_cu(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
);

修改test.py，这里继承自torch.autograd.Function函数

import torch
import cppcuda_tutorial
import time


def trilinear_interpolation_py_nocuda(feats, points):
    """
    Inputs:
        feats: (N, 8, F)
        points: (N, 3) local coordinates in [-1, 1]
    
    Outputs:
        feats_interp: (N, F)
    """
    u = (points[:, 0:1]+1)/2
    v = (points[:, 1:2]+1)/2
    w = (points[:, 2:3]+1)/2
    a = (1-v)*(1-w)
    b = (1-v)*w
    c = v*(1-w)
    d = 1-a-b-c

    feats_interp = (1-u)*(a*feats[:, 0] +
                          b*feats[:, 1] +
                          c*feats[:, 2] +
                          d*feats[:, 3]) + \
                       u*(a*feats[:, 4] +
                          b*feats[:, 5] +
                          c*feats[:, 6] +
                          d*feats[:, 7])
    
    return feats_interp


class Trilinear_interpolation_cuda(torch.autograd.Function):
    @staticmethod
    def forward(ctx, feats, points):
        feat_interp = cppcuda_tutorial.trilinear_interpolation_fw(feats, points)

        ctx.save_for_backward(feats, points)

        return feat_interp

    @staticmethod
    def backward(ctx, dL_dfeat_interp):
        feats, points = ctx.saved_tensors

        dL_dfeats = cppcuda_tutorial.trilinear_interpolation_bw(dL_dfeat_interp.contiguous(), feats, points)

        return dL_dfeats, None


if __name__ == '__main__':
    N = 10000; F = 256
    rand = torch.rand(N, 8, F, device='cuda')
    feats = rand.clone().requires_grad_()
    feats2 = rand.clone().requires_grad_()
    points = torch.rand(N, 3, device='cuda')*2-1

    t = time.time()
    out_cuda = Trilinear_interpolation_cuda.apply(feats2, points)
    torch.cuda.synchronize()
    print('   cuda fw time', time.time()-t, 's')

    t = time.time()
    out_py = trilinear_interpolation_py_nocuda(feats, points)
    torch.cuda.synchronize()
    print('pytorch fw time', time.time()-t, 's')

    print('fw all close', torch.allclose(out_py, out_cuda))

    t = time.time()
    loss2 = out_cuda.sum()
    loss2.backward()
    torch.cuda.synchronize()
    print('   cuda bw time', time.time()-t, 's')

    t = time.time()
    loss = out_py.sum()
    loss.backward()
    torch.cuda.synchronize()
    print('pytorch bw time', time.time()-t, 's')

    print('bw all close', torch.allclose(feats.grad, feats2.grad))

备注：这里直接用输出out作为loss，实际使用中通常为out与groud truth之间的差值。
运行结果：
cuda fw time 0.0038416385650634766 s
pytorch fw time 0.010390996932983398 s
fw all close True
cuda bw time 0.021845340728759766 s
pytorch bw time 0.06973433494567871 s
bw all close True
如果报错torch.cuda.OutOfMemoryError: CUDA out of memory.说明显存不够，调小N或者F的值即可。
完结，先别撒花！
分割线以下为讨论内容，选入。

一般pytorch神经网络计算过程，需要构建多层网络（对应本教程就是计算三线性插值的部分），需要构建损失函数（对应本教程直接用输出），需要迭代优化函数（本教程省略）。
疑问：
1.本教程中计算三线性插值过程是不是平行运算？（每个点计算过程完全一样？）如果是，那最开始说的不是非平行运算场景吗？
对于NeRF中的光线采样过程，选择不同位置和角度，在coarse过程每条光线采样点位置和数量一样（平行运算？），在fine过程每条光线采样点位置和数量不一样（非平行运算？）
2.pytorch也用到了GPU或者说cuda，那它的实现是怎样的？和本教程有什么不同？

空客A380

关注

22
点赞
踩
29

收藏

觉得还不错? 一键收藏
0
评论
Pytorch中实现CUDA加速【来自AI葵 pytorch-cppcuda-tutorial】

【本教程主要参考自kwea123】GitHub - kwea123/pytorch-cppcuda-tutorial: tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)目的实现GPU加速计算，场景有非平行运算和大量串行运算两种，本教程主要针对第一种如体素渲染的应用。非平行运算：像体素渲染那样的，由于每次采样点个数不一样，计算过程有差别，但有相似性又需要大量重复计算。（对比像神
复制链接

扫一扫