【本教程主要参考自kwea123】
GitHub - kwea123/pytorch-cppcuda-tutorial: tutorial for writing custom pytorch cpp+cuda kernel, applied on volume rendering (NeRF)
目的
实现GPU加速计算,场景有非平行运算和大量串行运算两种,本教程主要针对第一种如体素渲染的应用。
非平行运算:像体素渲染那样的,由于每次采样点个数不一样,计算过程有差别,但有相似性又需要大量重复计算。(对比像神经网络那样的,每一次计算过程完全一样,只是数据不一样的叫平行运算。)
准备
CUDA+cudnn环境
VS Code配置C++环境,依然优雅🐶;
Anaconda提供虚拟环境
Part1 Pytorch调用cpp
步骤:
conda创建虚拟环境
pip安装pytorch(已经习惯了本地.whl安装方式,稳定!)
vs code里新建interpolation.cpp
#include <torch/extension.h>
int main() {
cout<<"hello"<<endl;
}
Ctrl+Shift+P,C/C++编辑配置,添加includePath,会在.vscode目录下新建c_cpp_properties.json
其实,这一步只是为了消除代码中include找不到.h头文件产生红色下划线的问题,虽然我这里还是会有,但其实只是代码提示作用,并不影响。
以下分割线部分,误入歧途,慎入。
此时,调试运行程序还是会报错找不到.h库文件。
其实,正常的添加第三方库是需要写CMakeLists.txt!
其实,pytroch官方也给出了调试运行过程:Installing C++ Distributions of PyTorch — PyTorch main documentation
example-app/
CMakeLists.txt
example-app.cpp
cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(example-app)
find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${TORCH_CXX_FLAGS}")
if(NOT Torch_FOUND)
message(FATAL_ERROR "Pytorch Not Found!")
endif(NOT Torch_FOUND)
message(STATUS "Pytorch status: ")
message(STATUS " libraries: ${TORCH_LIBRARIES}")
message(STATUS " prefix: ${TORCH_INSTALL_PREFIX}")
add_executable(example-app example-app.cpp)
target_link_libraries(example-app "${TORCH_LIBRARIES}")
set_property(TARGET example-app PROPERTY CXX_STANDARD 17)
# The following code block is suggested to be used on Windows.
# According to https://github.com/pytorch/pytorch/issues/25457,
# the DLLs need to be copied to avoid memory errors.
if (MSVC)
file(GLOB TORCH_DLLS "${TORCH_INSTALL_PREFIX}/lib/*.dll")
add_custom_command(TARGET example-app
POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different
${TORCH_DLLS}
$<TARGET_FILE_DIR:example-app>)
endif (MSVC)
#include <torch/torch.h>
#include <iostream>
int main() {
torch::Tensor tensor = torch::rand({2, 3});
std::cout << tensor << std::endl;
}
mkdir build
cd build
cmake -DCMAKE_PREFIX_PATH=/absolute/path/to/libtorch ..
cmake --build . --config Release
备注:这里的绝对路径可以是下载的libtorch(https://pytorch.org/,见下图),也可以是之前下载的pytorch(例如D:\pytorch-cpp-cuda\venv\Lib\site-packages\torch)
运行结果:
PS D:\pytorch-cpp-cuda\build\Release> .\example-app
0.0059 0.7112 0.7168
0.0425 0.4746 0.4270
[ CPUFloatType{2,3} ]
其实,已经跑偏了!!!!
以上做法是进行C++调试项目,而本教程的目的是**C++包装算子,**也就是通过后面的setup.py打包,心急吃不了热豆腐!
言归正传,
interpolation.cpp
#include <torch/extension.h>
torch::Tensor trilinear_interpolation_cpp(
torch::Tensor feats,
torch::Tensor point
){
return feats;
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
m.def("trilinear_interpolation_py", &trilinear_interpolation_cpp);
}
setup.py
from setuptools import setup
from torch.utils.cpp_extension import CppExtension,BuildExtension
setup(
name='cppcuda_tutorial',
version='1.0',
author='Robin',
author_email='704805264@qq.com',
description='cppcuda example',
long_description='cppcuda example',
ext_modules=[
CppExtension(
name='cppcuda_tutorial',
sources=['interpolation.cpp']
)
],
cmdclass={
'build_ext':BuildExtension
}
)
先升级pip:python -m pip install pip -U
再编译安装:pip install .
备注:打开终端快捷键Ctrl+~
test.py
import torch
import cppcuda_tutorial
feats=torch.ones(2)
point=torch.zeros(2)
out=cppcuda_tutorial.trilinear_interpolation_py(feats,point)
print(out)
终端运行python test.py,输出结果
PS D:\pytorch-cpp-cuda> python .\test.py
tensor([1., 1.])
没问题!
总结逻辑关系图如下:
Part2 Pytorch经cpp调用CUDA
新建interpolation_kernal.cu
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
torch::Tensor trilinear_fw_cu(
torch::Tensor feats,
torch::Tensor points
){
return feats;
}
对应的头文件:新建include目录下utils.h
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
torch::Tensor trilinear_fw_cu(
const torch::Tensor feats,
const torch::Tensor points
);
修改interpolation.cpp
#include <utils.h>
torch::Tensor trilinear_interpolation(
torch::Tensor feats,
torch::Tensor points
){
CHECK_INPUT(feats);
CHECK_INPUT(points);
return trilinear_fw_cu(feats,points);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME,m){
m.def("trilinear_interpolation", &trilinear_interpolation);
}
注意:#include的是<utils.h>,而不是<include/utils.h>,否则pip安装时报找不到该文件的错!我是要把所有的坑踩完吗?😭
修改setup.py
import glob
import os.path as osp
from setuptools import setup
from torch.utils.cpp_extension import CUDAExtension, BuildExtension
ROOT_DIR = osp.dirname(osp.abspath(__file__))
include_dirs = [osp.join(ROOT_DIR, "include")]
sources = glob.glob('*.cpp')+glob.glob('*.cu')
setup(
name='cppcuda_tutorial',
version='1.0',
author='kwea123',
author_email='kwea123@gmail.com',
description='cppcuda_tutorial',
long_description='cppcuda_tutorial',
ext_modules=[
CUDAExtension(
name='cppcuda_tutorial',
sources=sources,
include_dirs=include_dirs,
extra_compile_args={'cxx': ['-O2'],
'nvcc': ['-O2']}
)
],
cmdclass={
'build_ext': BuildExtension
}
)
终端pip install .安装
修改test.py
import torch
import cppcuda_tutorial
import os
if __name__=='__main__':
feats=torch.ones(2,device='cuda')
point=torch.zeros(2,device='cuda')
out=cppcuda_tutorial.trilinear_interpolation(feats,point)
print(out)
总结逻辑关系图如下:
Part3 CUDA原理
GPU原理:
修改interpolation_kernel.cu
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
torch::Tensor trilinear_fw_cu(
torch::Tensor feats,
torch::Tensor points
){
const int N=feats.size(0),F=feats.size(2);
torch::Tensor feat_interp=torch::zeros({N,F},feats.options());
const dim3 threads(16,16);
const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);
AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
([&]{
trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
);
}));
return feat_interp;
}
Part4 CUDA实现
解决三线性插值问题
修改interpolation_kernel.cu
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
template <typename scalar_t>
__global__ void trilinear_fw_kernel(
const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
const int n = blockIdx.x * blockDim.x + threadIdx.x;
const int f = blockIdx.y * blockDim.y + threadIdx.y;
//覆盖的某些部分并没有数据,不需要计算
if (n>=feats.size(0) || f>=feats.size(2)) return;
// point -1~1
const scalar_t u = (points[n][0]+1)/2;
const scalar_t v = (points[n][1]+1)/2;
const scalar_t w = (points[n][2]+1)/2;
const scalar_t a = (1-v)*(1-w);
const scalar_t b = (1-v)*w;
const scalar_t c = v*(1-w);
const scalar_t d = 1-a-b-c;
feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
b*feats[n][1][f] +
c*feats[n][2][f] +
d*feats[n][3][f]) +
u*(a*feats[n][4][f] +
b*feats[n][5][f] +
c*feats[n][6][f] +
d*feats[n][7][f]);
}
torch::Tensor trilinear_fw_cu(
torch::Tensor feats,
torch::Tensor points
){
const int N=feats.size(0),F=feats.size(2);
torch::Tensor feat_interp=torch::zeros({N,F},feats.options());
const dim3 threads(16,16);
const dim3 blocks((N+threads.x-1)/threads.x,(F+threads.y-1)/threads.y);
AT_DISPATCH_FLOATING_TYPES(feats.type(),"trilinear_fw_cu",
([&]{
trilinear_fw_kernel<scalar_t><<<blocks,threads>>>(
feats.packed_accessor<scalar_t,3,torch::RestrictPtrTraits,size_t>(),
points.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>(),
feat_interp.packed_accessor<scalar_t,2,torch::RestrictPtrTraits,size_t>()
);
}));
return feat_interp;
}
test.py中用python直接实现,与cuda结果比较是否一致,并比较计算时间。
import torch
import cppcuda_tutorial
import time
def trilinear_interpolation_py_nocuda(feats, points):
u = (points[:, 0:1]+1)/2
v = (points[:, 1:2]+1)/2
w = (points[:, 2:3]+1)/2
a = (1-v)*(1-w)
b = (1-v)*w
c = v*(1-w)
d = 1-a-b-c
feats_interp = (1-u)*(a*feats[:, 0] +
b*feats[:, 1] +
c*feats[:, 2] +
d*feats[:, 3]) + \
u*(a*feats[:, 4] +
b*feats[:, 5] +
c*feats[:, 6] +
d*feats[:, 7])
return feats_interp
if __name__=='__main__':
N=80000
F=256
feats=torch.rand(N,8,F,device='cuda')
points=torch.rand(N,3,device='cuda')*2-1
t = time.time()
out_py_cuda = cppcuda_tutorial.trilinear_interpolation_py(feats, points)
torch.cuda.synchronize()
print(' cuda fw time: ', time.time()-t, 's')
t = time.time()
out_py_nocuda = trilinear_interpolation_py_nocuda(feats, points)
torch.cuda.synchronize()
print('pytorch fw time: ', time.time()-t, 's')
print('fw all close: ', torch.allclose(out_py_cuda, out_py_nocuda))
备注:torch.cuda.synchronize()-CSDN博客
输出结果:
cuda fw time: 0.021244049072265625 s
pytorch fw time: 0.05333447456359863 s
fw all close: True
Part5 CUDA反向传播
cuda计算的结果不具有梯度(可以打印out_py_cuda.requires_grad查看),不能进行loss的反向传播。
∂
f
∂
f
1
=
(
1
−
u
)
(
1
−
v
)
∂
f
∂
f
3
=
(
1
−
u
)
v
∂
f
∂
f
2
=
u
(
1
−
v
)
∂
f
∂
f
4
=
u
v
\frac{\partial f}{\partial f_{1}}=\left(1-u\right)\left(1-v\right)~~~~~~~~~ \frac{\partial f}{\partial f_{3}}=\left(1-u\right)v\\\frac{\partial f}{\partial f_{2}}=u\left(1-v\right)~~~~~~~~~~~~~~~~~~~~\frac{\partial f}{\partial f_{4}}=uv
∂f1∂f=(1−u)(1−v) ∂f3∂f=(1−u)v∂f2∂f=u(1−v) ∂f4∂f=uv
∂ L ∂ f 1 = ∂ L ∂ f ⋅ ∂ f ∂ f 1 = ∂ L ∂ f ⋅ ( 1 − u ) ( 1 − v ) \frac{\partial L}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot\frac{\partial f}{\partial f_{1}}=\frac{\partial L}{\partial f}\cdot(1-u)(1-v) ∂f1∂L=∂f∂L⋅∂f1∂f=∂f∂L⋅(1−u)(1−v),……
修改interpolation_kernel.cu:
#include <torch/extension.h>
template <typename scalar_t>
__global__ void trilinear_fw_kernel(
const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> feat_interp
){
const int n = blockIdx.x * blockDim.x + threadIdx.x;
const int f = blockIdx.y * blockDim.y + threadIdx.y;
if (n>=feats.size(0) || f>=feats.size(2)) return;
// point -1~1
const scalar_t u = (points[n][0]+1)/2;
const scalar_t v = (points[n][1]+1)/2;
const scalar_t w = (points[n][2]+1)/2;
const scalar_t a = (1-v)*(1-w);
const scalar_t b = (1-v)*w;
const scalar_t c = v*(1-w);
const scalar_t d = 1-a-b-c;
feat_interp[n][f] = (1-u)*(a*feats[n][0][f] +
b*feats[n][1][f] +
c*feats[n][2][f] +
d*feats[n][3][f]) +
u*(a*feats[n][4][f] +
b*feats[n][5][f] +
c*feats[n][6][f] +
d*feats[n][7][f]);
}
torch::Tensor trilinear_fw_cu(
const torch::Tensor feats,
const torch::Tensor points
){
const int N = feats.size(0), F = feats.size(2);
torch::Tensor feat_interp = torch::empty({N, F}, feats.options());
const dim3 threads(16, 16);
const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);
AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_fw_cu",
([&] {
trilinear_fw_kernel<scalar_t><<<blocks, threads>>>(
feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
feat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>()
);
}));
return feat_interp;
}
template <typename scalar_t>
__global__ void trilinear_bw_kernel(
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> dL_dfeat_interp,
const torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> feats,
const torch::PackedTensorAccessor<scalar_t, 2, torch::RestrictPtrTraits, size_t> points,
torch::PackedTensorAccessor<scalar_t, 3, torch::RestrictPtrTraits, size_t> dL_dfeats
){
const int n = blockIdx.x * blockDim.x + threadIdx.x;
const int f = blockIdx.y * blockDim.y + threadIdx.y;
if (n>=feats.size(0) || f>=feats.size(2)) return;
// point -1~1
const scalar_t u = (points[n][0]+1)/2;
const scalar_t v = (points[n][1]+1)/2;
const scalar_t w = (points[n][2]+1)/2;
const scalar_t a = (1-v)*(1-w);
const scalar_t b = (1-v)*w;
const scalar_t c = v*(1-w);
const scalar_t d = 1-a-b-c;
dL_dfeats[n][0][f] = (1-u)*a*dL_dfeat_interp[n][f];
dL_dfeats[n][1][f] = (1-u)*b*dL_dfeat_interp[n][f];
dL_dfeats[n][2][f] = (1-u)*c*dL_dfeat_interp[n][f];
dL_dfeats[n][3][f] = (1-u)*d*dL_dfeat_interp[n][f];
dL_dfeats[n][4][f] = u*a*dL_dfeat_interp[n][f];
dL_dfeats[n][5][f] = u*b*dL_dfeat_interp[n][f];
dL_dfeats[n][6][f] = u*c*dL_dfeat_interp[n][f];
dL_dfeats[n][7][f] = u*d*dL_dfeat_interp[n][f];
}
torch::Tensor trilinear_bw_cu(
const torch::Tensor dL_dfeat_interp,
const torch::Tensor feats,
const torch::Tensor points
){
const int N = feats.size(0), F = feats.size(2);
torch::Tensor dL_dfeats = torch::empty({N, 8, F}, feats.options());
const dim3 threads(16, 16);
const dim3 blocks((N+threads.x-1)/threads.x, (F+threads.y-1)/threads.y);
AT_DISPATCH_FLOATING_TYPES(feats.type(), "trilinear_bw_cu",
([&] {
trilinear_bw_kernel<scalar_t><<<blocks, threads>>>(
dL_dfeat_interp.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
feats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>(),
points.packed_accessor<scalar_t, 2, torch::RestrictPtrTraits, size_t>(),
dL_dfeats.packed_accessor<scalar_t, 3, torch::RestrictPtrTraits, size_t>()
);
}));
return dL_dfeats;
}
修改interpolation.cpp和utils.h,添加上backward部分
#include "utils.h"
torch::Tensor trilinear_interpolation_fw(
const torch::Tensor feats,
const torch::Tensor points
){
CHECK_INPUT(feats);
CHECK_INPUT(points);
return trilinear_fw_cu(feats, points);
}
torch::Tensor trilinear_interpolation_bw(
const torch::Tensor dL_dfeat_interp,
const torch::Tensor feats,
const torch::Tensor points
){
CHECK_INPUT(dL_dfeat_interp);
CHECK_INPUT(feats);
CHECK_INPUT(points);
return trilinear_bw_cu(dL_dfeat_interp, feats, points);
}
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m){
m.def("trilinear_interpolation_fw", &trilinear_interpolation_fw);
m.def("trilinear_interpolation_bw", &trilinear_interpolation_bw);
}
#include <torch/extension.h>
#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)
torch::Tensor trilinear_fw_cu(
const torch::Tensor feats,
const torch::Tensor points
);
torch::Tensor trilinear_bw_cu(
const torch::Tensor dL_dfeat_interp,
const torch::Tensor feats,
const torch::Tensor points
);
修改test.py,这里继承自torch.autograd.Function函数
import torch
import cppcuda_tutorial
import time
def trilinear_interpolation_py_nocuda(feats, points):
"""
Inputs:
feats: (N, 8, F)
points: (N, 3) local coordinates in [-1, 1]
Outputs:
feats_interp: (N, F)
"""
u = (points[:, 0:1]+1)/2
v = (points[:, 1:2]+1)/2
w = (points[:, 2:3]+1)/2
a = (1-v)*(1-w)
b = (1-v)*w
c = v*(1-w)
d = 1-a-b-c
feats_interp = (1-u)*(a*feats[:, 0] +
b*feats[:, 1] +
c*feats[:, 2] +
d*feats[:, 3]) + \
u*(a*feats[:, 4] +
b*feats[:, 5] +
c*feats[:, 6] +
d*feats[:, 7])
return feats_interp
class Trilinear_interpolation_cuda(torch.autograd.Function):
@staticmethod
def forward(ctx, feats, points):
feat_interp = cppcuda_tutorial.trilinear_interpolation_fw(feats, points)
ctx.save_for_backward(feats, points)
return feat_interp
@staticmethod
def backward(ctx, dL_dfeat_interp):
feats, points = ctx.saved_tensors
dL_dfeats = cppcuda_tutorial.trilinear_interpolation_bw(dL_dfeat_interp.contiguous(), feats, points)
return dL_dfeats, None
if __name__ == '__main__':
N = 10000; F = 256
rand = torch.rand(N, 8, F, device='cuda')
feats = rand.clone().requires_grad_()
feats2 = rand.clone().requires_grad_()
points = torch.rand(N, 3, device='cuda')*2-1
t = time.time()
out_cuda = Trilinear_interpolation_cuda.apply(feats2, points)
torch.cuda.synchronize()
print(' cuda fw time', time.time()-t, 's')
t = time.time()
out_py = trilinear_interpolation_py_nocuda(feats, points)
torch.cuda.synchronize()
print('pytorch fw time', time.time()-t, 's')
print('fw all close', torch.allclose(out_py, out_cuda))
t = time.time()
loss2 = out_cuda.sum()
loss2.backward()
torch.cuda.synchronize()
print(' cuda bw time', time.time()-t, 's')
t = time.time()
loss = out_py.sum()
loss.backward()
torch.cuda.synchronize()
print('pytorch bw time', time.time()-t, 's')
print('bw all close', torch.allclose(feats.grad, feats2.grad))
备注:这里直接用输出out作为loss,实际使用中通常为out与groud truth之间的差值。
运行结果:
cuda fw time 0.0038416385650634766 s
pytorch fw time 0.010390996932983398 s
fw all close True
cuda bw time 0.021845340728759766 s
pytorch bw time 0.06973433494567871 s
bw all close True
如果报错torch.cuda.OutOfMemoryError: CUDA out of memory.说明显存不够,调小N或者F的值即可。
完结,先别撒花!
分割线以下为讨论内容,选入。
一般pytorch神经网络计算过程,需要构建多层网络(对应本教程就是计算三线性插值的部分),需要构建损失函数(对应本教程直接用输出),需要迭代优化函数(本教程省略)。
疑问:
1.本教程中计算三线性插值过程是不是平行运算?(每个点计算过程完全一样?)如果是,那最开始说的不是非平行运算场景吗?
对于NeRF中的光线采样过程,选择不同位置和角度,在coarse过程每条光线采样点位置和数量一样(平行运算?),在fine过程每条光线采样点位置和数量不一样(非平行运算?)
2.pytorch也用到了GPU或者说cuda,那它的实现是怎样的?和本教程有什么不同?