Pytorch中实现CUDA加速【来自AI葵 pytorch-cppcuda-tutorial】

【本教程主要参考自kwea123】
GitHub - kwea123/pytorch-cppcuda-tutorial: tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)

目的

实现GPU加速计算,场景有非平行运算和大量串行运算两种,本教程主要针对第一种如体素渲染的应用。
非平行运算:像体素渲染那样的,由于每次采样点个数不一样,计算过程有差别,但有相似性又需要大量重复计算。(对比像神经网络那样的,每一次计算过程完全一样,只是数据不一样的叫平行运算。)

准备

CUDA+cudnn环境
VS Code配置C++环境,依然优雅🐶;
Anaconda提供虚拟环境

Part1 Pytorch调用cpp

步骤:
conda创建虚拟环境
pip安装pytorch(已经习惯了本地.whl安装方式,稳定!)
vs code里新建interpolation.cpp

#include <torch/extension.h>

int main() {
    cout<<"hello"<<endl;
}

Ctrl+Shift+P,C/C++编辑配置,添加includePath,会在.vscode目录下新建c_cpp_properties.json
其实,这一步只是为了消除代码中include找不到.h头文件产生红色下划线的问题,虽然我这里还是会有,但其实只是代码提示作用,并不影响。
以下分割线部分,误入歧途,慎入。


此时,调试运行程序还是会报错找不到.h库文件。
其实,正常的添加第三方库是需要写CMakeLists.txt!
其实,pytroch官方也给出了调试运行过程:Installing C++ Distributions of PyTorch — PyTorch main documentation

example-app/
  CMakeLists.txt
  example-app.cpp
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")

if(NOT Torch_FOUND)
  message(FATAL_ERROR "Pytorch Not Found!")
endif(NOT Torch_FOUND)

message(STATUS "Pytorch status: ")
message(STATUS "  libraries: ${TORCH_LIBRARIES}")
message(STATUS "  prefix: ${TORCH_INSTALL_PREFIX}")

add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)

# The following code block is suggested to be used on Windows.
# According to https://github.com/pytorch/pytorch/issues/25457,
# the DLLs need to be copied to avoid memory errors.
if (MSVC)
  file(GLOB TORCH_DLLS "${TORCH_INSTALL_PREFIX}/lib/*.dll")
  add_custom_command(TARGET example-app
                     POST_BUILD
                     COMMAND ${CMAKE_COMMAND} -E copy_if_different
                     ${TORCH_DLLS}
                     $<TARGET_FILE_DIR:example-app>)
endif (MSVC)
#include <torch/torch.h>
#include <iostream>

int main() {
  torch::Tensor tensor = torch::rand({2, 3});
  std::cout << tensor << std::endl;
}
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=/absolute/path/to/libtorch ..
cmake --build . --config Release

备注:这里的绝对路径可以是下载的libtorchhttps://pytorch.org/,见下图),也可以是之前下载的pytorch(例如D:\pytorch-cpp-cuda\venv\Lib\site-packages\torch)
image.png
运行结果:
PS D:\pytorch-cpp-cuda\build\Release> .\example-app
0.0059 0.7112 0.7168
0.0425 0.4746 0.4270
[ CPUFloatType{2,3} ]
其实,已经跑偏了!!!!
以上做法是进行C++调试项目,而本教程的目的是**C++包装算子,**也就是通过后面的setup.py打包,心急吃不了热豆腐!


言归正传,
interpolation.cpp

#include <torch/extension.h>

torch::Tensor trilinear_interpolation_cpp(
    torch::Tensor feats,
    torch::Tensor point
){
    return feats;
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
    m.def("trilinear_interpolation_py", &trilinear_interpolation_cpp);
}

setup.py

from setuptools import setup
from torch.utils.cpp_extension import CppExtension,BuildExtension

setup(
    name='cppcuda_tutorial',
    version='1.0',
    author='Robin',
    author_email='704805264@qq.com',
    description='cppcuda example',
    long_description='cppcuda example',
    ext_modules=[
        CppExtension(
            name='cppcuda_tutorial',
            sources=['interpolation.cpp']
        )
    ],
    cmdclass={
        'build_ext':BuildExtension
    }
)

先升级pip:python -m pip install pip -U
再编译安装:pip install .
备注:打开终端快捷键Ctrl+~
test.py

import torch
import cppcuda_tutorial

feats=torch.ones(2)
point=torch.zeros(2)

out=cppcuda_tutorial.trilinear_interpolation_py(feats,point)

print(out)

终端运行python test.py,输出结果
PS D:\pytorch-cpp-cuda> python .\test.py
tensor([1., 1.])
没问题!
总结逻辑关系图如下:

Part2 Pytorch经cpp调用CUDA

新建interpolation_kernal.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    torch::Tensor feats,
    torch::Tensor points
){
    return feats;
}

对应的头文件:新建include目录下utils.h

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
);

修改interpolation.cpp

#include <utils.h>

torch::Tensor trilinear_interpolation(
    torch::Tensor feats,
    torch::Tensor points
){
    CHECK_INPUT(feats);
    CHECK_INPUT(points);
    return trilinear_fw_cu(feats,points);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
    m.def("trilinear_interpolation", &trilinear_interpolation);
}

注意:#include的是<utils.h>,而不是<include/utils.h>,否则pip安装时报找不到该文件的错!我是要把所有的坑踩完吗?😭
修改setup.py

import glob
import os.path as osp
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension


ROOT_DIR = osp.dirname(osp.abspath(__file__))
include_dirs = [osp.join(ROOT_DIR, "include")]

sources = glob.glob('*.cpp')+glob.glob('*.cu')


setup(
    name='cppcuda_tutorial',
    version='1.0',
    author='kwea123',
    author_email='kwea123@gmail.com',
    description='cppcuda_tutorial',
    long_description='cppcuda_tutorial',
    ext_modules=[
        CUDAExtension(
            name='cppcuda_tutorial',
            sources=sources,
            include_dirs=include_dirs,
            extra_compile_args={'cxx': ['-O2'],
                                'nvcc': ['-O2']}
        )
    ],
    cmdclass={
        'build_ext': BuildExtension
    }
)

终端pip install .安装
修改test.py

import torch
import cppcuda_tutorial
import os

if __name__=='__main__':
    feats=torch.ones(2,device='cuda')
    point=torch.zeros(2,device='cuda')

    out=cppcuda_tutorial.trilinear_interpolation(feats,point)

    print(out)

总结逻辑关系图如下:

Part3 CUDA原理

GPU原理:
image.png
image.png

修改interpolation_kernel.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    torch::Tensor feats,
    torch::Tensor points
){
    const int N=feats.size(0),F=feats.size(2);
    torch::Tensor feat_interp=torch::zeros({N,F},feats.options());

    const dim3 threads(16,16);
    const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
    ([&]{
        trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
            feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
            points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
            feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
        );
    }));
    return feat_interp;
}

Part4 CUDA实现

解决三线性插值问题

修改interpolation_kernel.cu

#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

template <typename scalar_t>
__global__ void trilinear_fw_kernel(
const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;
    
    //覆盖的某些部分并没有数据,不需要计算
    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;

    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;
    feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
        b*feats[n][1][f] +
        c*feats[n][2][f] +
        d*feats[n][3][f]) + 
        u*(a*feats[n][4][f] +
        b*feats[n][5][f] +
        c*feats[n][6][f] +
        d*feats[n][7][f]);
}

torch::Tensor trilinear_fw_cu(
torch::Tensor feats,
torch::Tensor points
){
    const int N=feats.size(0),F=feats.size(2);
    torch::Tensor feat_interp=torch::zeros({N,F},feats.options());

    const dim3 threads(16,16);
    const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
    ([&]{
        trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
            feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
            points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
            feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
            );
    }));
    return feat_interp;
}

test.py中用python直接实现,与cuda结果比较是否一致,并比较计算时间。

import torch
import cppcuda_tutorial
import time

def trilinear_interpolation_py_nocuda(feats, points):
    u = (points[:, 0:1]+1)/2
    v = (points[:, 1:2]+1)/2
    w = (points[:, 2:3]+1)/2
    a = (1-v)*(1-w)
    b = (1-v)*w
    c = v*(1-w)
    d = 1-a-b-c

    feats_interp = (1-u)*(a*feats[:, 0] +
                          b*feats[:, 1] +
                          c*feats[:, 2] +
                          d*feats[:, 3]) + \
                       u*(a*feats[:, 4] +
                          b*feats[:, 5] +
                          c*feats[:, 6] +
                          d*feats[:, 7])
    
    return feats_interp

if __name__=='__main__':
    N=80000
    F=256
    feats=torch.rand(N,8,F,device='cuda')
    points=torch.rand(N,3,device='cuda')*2-1

    t = time.time()
    out_py_cuda = cppcuda_tutorial.trilinear_interpolation_py(feats, points)
    torch.cuda.synchronize()
    print('   cuda fw time: ', time.time()-t, 's')

    t = time.time()
    out_py_nocuda = trilinear_interpolation_py_nocuda(feats, points)
    torch.cuda.synchronize()
    print('pytorch fw time: ', time.time()-t, 's')
    
    print('fw all close: ', torch.allclose(out_py_cuda, out_py_nocuda))

备注:torch.cuda.synchronize()-CSDN博客
输出结果:
cuda fw time: 0.021244049072265625 s
pytorch fw time: 0.05333447456359863 s
fw all close: True

Part5 CUDA反向传播

cuda计算的结果不具有梯度(可以打印out_py_cuda.requires_grad查看),不能进行loss的反向传播。
∂ f ∂ f 1 = ( 1 − u ) ( 1 − v )           ∂ f ∂ f 3 = ( 1 − u ) v ∂ f ∂ f 2 = u ( 1 − v )                      ∂ f ∂ f 4 = u v \frac{\partial f}{\partial f_{1}}=\left(1-u\right)\left(1-v\right)~~~~~~~~~ \frac{\partial f}{\partial f_{3}}=\left(1-u\right)v\\\frac{\partial f}{\partial f_{2}}=u\left(1-v\right)~~~~~~~~~~~~~~~~~~~~\frac{\partial f}{\partial f_{4}}=uv f1f=(1u)(1v)         f3f=(1u)vf2f=u(1v)                    f4f=uv

∂ L ∂ f 1 = ∂ L ∂ f ⋅ ∂ f ∂ f 1 = ∂ L ∂ f ⋅ ( 1 − u ) ( 1 − v ) \frac{\partial L}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot\frac{\partial f}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot(1-u)(1-v) f1L=fLf1f=fL(1u)(1v),……

修改interpolation_kernel.cu:

#include <torch/extension.h>


template <typename scalar_t>
__global__ void trilinear_fw_kernel(
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
    torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;

    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;
    
    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;
    feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
                               b*feats[n][1][f] +
                               c*feats[n][2][f] +
                               d*feats[n][3][f]) + 
                            u*(a*feats[n][4][f] +
                               b*feats[n][5][f] +
                               c*feats[n][6][f] +
                               d*feats[n][7][f]);
}


torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
){
    const int N = feats.size(0), F = feats.size(2);
    
    torch::Tensor feat_interp = torch::empty({N, F}, feats.options());

    const dim3 threads(16, 16);
    const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_fw_cu", 
    ([&] {
        trilinear_fw_kernel<scalar_t><<<blocks, threads>>>(
            feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            feat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>()
        );
    }));

    return feat_interp;
}


template <typename scalar_t>
__global__ void trilinear_bw_kernel(
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> dL_dfeat_interp,
    const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
    const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
    torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> dL_dfeats
){
    const int n = blockIdx.x * blockDim.x + threadIdx.x;
    const int f = blockIdx.y * blockDim.y + threadIdx.y;

    if (n>=feats.size(0) || f>=feats.size(2)) return;

    // point -1~1
    const scalar_t u = (points[n][0]+1)/2;
    const scalar_t v = (points[n][1]+1)/2;
    const scalar_t w = (points[n][2]+1)/2;
    
    const scalar_t a = (1-v)*(1-w);
    const scalar_t b = (1-v)*w;
    const scalar_t c = v*(1-w);
    const scalar_t d = 1-a-b-c;

    dL_dfeats[n][0][f] = (1-u)*a*dL_dfeat_interp[n][f];
    dL_dfeats[n][1][f] = (1-u)*b*dL_dfeat_interp[n][f];
    dL_dfeats[n][2][f] = (1-u)*c*dL_dfeat_interp[n][f];
    dL_dfeats[n][3][f] = (1-u)*d*dL_dfeat_interp[n][f];
    dL_dfeats[n][4][f] = u*a*dL_dfeat_interp[n][f];
    dL_dfeats[n][5][f] = u*b*dL_dfeat_interp[n][f];
    dL_dfeats[n][6][f] = u*c*dL_dfeat_interp[n][f];
    dL_dfeats[n][7][f] = u*d*dL_dfeat_interp[n][f];
}


torch::Tensor trilinear_bw_cu(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
){
    const int N = feats.size(0), F = feats.size(2);
    
    torch::Tensor dL_dfeats = torch::empty({N, 8, F}, feats.options());

    const dim3 threads(16, 16);
    const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);

    AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_bw_cu", 
    ([&] {
        trilinear_bw_kernel<scalar_t><<<blocks, threads>>>(
            dL_dfeat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
            points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
            dL_dfeats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>()
        );
    }));

    return dL_dfeats;
}

修改interpolation.cpp和utils.h,添加上backward部分

#include "utils.h"


torch::Tensor trilinear_interpolation_fw(
    const torch::Tensor feats,
    const torch::Tensor points
){
    CHECK_INPUT(feats);
    CHECK_INPUT(points);

    return trilinear_fw_cu(feats, points);
}


torch::Tensor trilinear_interpolation_bw(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
){
    CHECK_INPUT(dL_dfeat_interp);
    CHECK_INPUT(feats);
    CHECK_INPUT(points);

    return trilinear_bw_cu(dL_dfeat_interp, feats, points);
}


PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
    m.def("trilinear_interpolation_fw", &trilinear_interpolation_fw);
    m.def("trilinear_interpolation_bw", &trilinear_interpolation_bw);
}
#include <torch/extension.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

torch::Tensor trilinear_fw_cu(
    const torch::Tensor feats,
    const torch::Tensor points
);


torch::Tensor trilinear_bw_cu(
    const torch::Tensor dL_dfeat_interp,
    const torch::Tensor feats,
    const torch::Tensor points
);

修改test.py,这里继承自torch.autograd.Function函数

import torch
import cppcuda_tutorial
import time


def trilinear_interpolation_py_nocuda(feats, points):
    """
    Inputs:
        feats: (N, 8, F)
        points: (N, 3) local coordinates in [-1, 1]
    
    Outputs:
        feats_interp: (N, F)
    """
    u = (points[:, 0:1]+1)/2
    v = (points[:, 1:2]+1)/2
    w = (points[:, 2:3]+1)/2
    a = (1-v)*(1-w)
    b = (1-v)*w
    c = v*(1-w)
    d = 1-a-b-c

    feats_interp = (1-u)*(a*feats[:, 0] +
                          b*feats[:, 1] +
                          c*feats[:, 2] +
                          d*feats[:, 3]) + \
                       u*(a*feats[:, 4] +
                          b*feats[:, 5] +
                          c*feats[:, 6] +
                          d*feats[:, 7])
    
    return feats_interp


class Trilinear_interpolation_cuda(torch.autograd.Function):
    @staticmethod
    def forward(ctx, feats, points):
        feat_interp = cppcuda_tutorial.trilinear_interpolation_fw(feats, points)

        ctx.save_for_backward(feats, points)

        return feat_interp

    @staticmethod
    def backward(ctx, dL_dfeat_interp):
        feats, points = ctx.saved_tensors

        dL_dfeats = cppcuda_tutorial.trilinear_interpolation_bw(dL_dfeat_interp.contiguous(), feats, points)

        return dL_dfeats, None


if __name__ == '__main__':
    N = 10000; F = 256
    rand = torch.rand(N, 8, F, device='cuda')
    feats = rand.clone().requires_grad_()
    feats2 = rand.clone().requires_grad_()
    points = torch.rand(N, 3, device='cuda')*2-1

    t = time.time()
    out_cuda = Trilinear_interpolation_cuda.apply(feats2, points)
    torch.cuda.synchronize()
    print('   cuda fw time', time.time()-t, 's')

    t = time.time()
    out_py = trilinear_interpolation_py_nocuda(feats, points)
    torch.cuda.synchronize()
    print('pytorch fw time', time.time()-t, 's')

    print('fw all close', torch.allclose(out_py, out_cuda))

    t = time.time()
    loss2 = out_cuda.sum()
    loss2.backward()
    torch.cuda.synchronize()
    print('   cuda bw time', time.time()-t, 's')

    t = time.time()
    loss = out_py.sum()
    loss.backward()
    torch.cuda.synchronize()
    print('pytorch bw time', time.time()-t, 's')

    print('bw all close', torch.allclose(feats.grad, feats2.grad))

备注:这里直接用输出out作为loss,实际使用中通常为out与groud truth之间的差值。
运行结果:
cuda fw time 0.0038416385650634766 s
pytorch fw time 0.010390996932983398 s
fw all close True
cuda bw time 0.021845340728759766 s
pytorch bw time 0.06973433494567871 s
bw all close True
如果报错torch.cuda.OutOfMemoryError: CUDA out of memory.说明显存不够,调小N或者F的值即可。
完结,先别撒花!
分割线以下为讨论内容,选入。


一般pytorch神经网络计算过程,需要构建多层网络(对应本教程就是计算三线性插值的部分),需要构建损失函数(对应本教程直接用输出),需要迭代优化函数(本教程省略)。
疑问:
1.本教程中计算三线性插值过程是不是平行运算?(每个点计算过程完全一样?)如果是,那最开始说的不是非平行运算场景吗?
对于NeRF中的光线采样过程,选择不同位置和角度,在coarse过程每条光线采样点位置和数量一样(平行运算?),在fine过程每条光线采样点位置和数量不一样(非平行运算?)
2.pytorch也用到了GPU或者说cuda,那它的实现是怎样的?和本教程有什么不同?

  • 22
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值