【模型加速】PointPillars模型TensorRT加速实验(6)

最新推荐文章于 2022-10-07 23:16:34 发布

昌山小屋

最新推荐文章于 2022-10-07 23:16:34 发布

阅读量868

点赞数 1

分类专栏：点云处理 tensorrt 文章标签： PointPillars tensorrt

本文链接：https://blog.csdn.net/chuigedaqiqiu/article/details/119685669

版权

点云处理同时被 2 个专栏收录

41 篇文章 47 订阅

订阅专栏

tensorrt

12 篇文章 5 订阅

订阅专栏

在【模型加速】PointPillars模型TensorRT加速实验(5)文末给出的实验结果表明，当前针对PFN的TensorRT的加速方案是有缺陷的，相比直接使用Pytorch GPU推理非但没有达到理想的加速效果，反而是更慢了。当我把PFN推理代码中最核心的3个步骤的时间消耗都打印出来你可以明显的看到这样一个事实：主机内存和设备内存之间的数据拷贝耗时比你想象中的要大得多。

#sync
t1 = time.time()
for hdm in inputs_hdm:
cuda.memcpy_htod(hdm.device,hdm.host)
print("pfn memcpy host->device time: {:.2f} ms.".format((time.time()-t1)*1000))
 
t1 = time.time()
self.pfn_context.execute_v2(bindings=bindings_m)
print("pfn execute_v2 time: {:.2f} ms.".format((time.time()-t1)*1000))
 
t1 = time.time()
for hdm in outputs_hdm:
    cuda.memcpy_dtoh(hdm.host,hdm.device)
print("pfn memcpy device->host time: {:.2f} ms.".format((time.time()-t1)*1000))

########################
points_path: /data/sets/kitti_second/training/velodyne/000001.bin
pfn memcpy host->device time: 6.46 ms.
pfn execute_v2 time: 1.50 ms.
pfn memcpy device->host time: 0.25 ms.
########################
points_path: /data/sets/kitti_second/training/velodyne/000001.bin
pfn memcpy host->device time: 6.47 ms.
pfn execute_v2 time: 1.52 ms.
pfn memcpy device->host time: 0.26 ms.

实验走到这一步，会不会感觉在当前的环境上(x86+2080 rtx)上PointPillars模型按照既定的方案在推理速度上要超越Pytorch(GPU)基本上很难了，加速实验变成了"去Pytorch"实验。首先，PFN这一段就被Pytorch(GPU)甩在了后面，然后到了MFN这里。因为TensorRT还不支持Scatter操作，所以要用其他替代方案。其实也是这个原因导致网络被人为分割成3段。嘿嘿，不要着急。先按照既定方案走下去，只要思想不滑坡，办法总比困难多。

MFN推理

不考虑使用Pytorch来推理，MFN至少有两个个直接了当的方式:(1).由numpy通过索引操作在cpu上实现;(2).由cuda在gpu上实现。

方法一：

   def mfn_inference(self,voxel_features,coords,nchannels=64,nx=432,ny=496):
        indices = coords[:, 1] * nx + coords[:, 2] #coord=>(z,y,x)
        voxel_features = voxel_features.squeeze() #shape[1,64,N,1]=>shape[64,N]
        canvas = np.zeros(shape=(nchannels, nx*ny), dtype=voxel_features.dtype)
            
        canvas[:, indices] = voxel_features
        canvas = canvas.reshape((1, nchannels, ny, nx))
        return canvas

推理：

mfn infer time: 16.88 ms.
mfn infer time: 18.57 ms.
mfn infer time: 19.00 ms.
mfn infer time: 18.35 ms.
mfn infer time: 18.39 ms.
mfn infer time: 18.44 ms.
mfn infer time: 18.38 ms.
mfn infer time: 18.71 ms.
mfn infer time: 18.41 ms.
mfn infer time: 18.39 ms.

方法二：

def mfn_inference_by_cuda(self,voxel_features,coords,nchannels=64,nx=432,ny=496):
        t1 = time.time()
        indices = coords[:, 1] * nx + coords[:, 2]
        indices_d = gpuarray.to_gpu(indices)
        canvas_d = gpuarray.zeros(shape=(nchannels, nx*ny), dtype=voxel_features.dtype)
        voxel_features = voxel_features.squeeze()
        voxel_features_d = gpuarray.to_gpu(voxel_features)
        
        scatter = g_module.get_function("scatter")
        t2 = time.time()
        scatter(indices_d,
            np.int32(nchannels),
            canvas_d, 
            np.int32(nx*ny),
            voxel_features_d,
            np.int32(voxel_features.shape[-1]),
            grid=(64, nx*ny // 512),
            block=(1, 512, 1))
        print("mfn infer scatter time: {:.3f} ms.".format((time.time()-t2)*1000.))
 
        canvas_d = canvas_d.reshape(1, nchannels, ny, nx)
        canvas = canvas_d.get()
        print("mfn infer time: {:.3f} ms.".format((time.time()-t1)*1000.))
 
        return canvas

推理：

mfn infer scatter time: 0.070 ms.
mfn infer time: 19.826 ms.
mfn infer scatter time: 0.089 ms.
mfn infer time: 20.053 ms.
mfn infer scatter time: 0.061 ms.
mfn infer time: 20.396 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 20.017 ms.
mfn infer scatter time: 0.060 ms.
mfn infer time: 20.498 ms.
mfn infer scatter time: 0.072 ms.
mfn infer time: 19.972 ms.
mfn infer scatter time: 0.061 ms.
mfn infer time: 20.596 ms.
mfn infer scatter time: 0.064 ms.
mfn infer time: 19.953 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 20.493 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 19.976 ms.
方法一使用numpy在cpu上实现，不存在数据在cpu内存和gpu显存之间的流动，推理一帧的平均耗时>10ms，这当然是太慢了。而方法二虽然核心Scatter操作速度很快(<1ms)，但一旦加上把数据从gpu显存copy回内存这一过程，则完全不占优势。so，加速的思路明显，优化流程。推理过程中尽可能让数据在GPU显存上流动，上一阶段的输出，简单处理后送入下一个阶段作为输入。这里，我先贴一个流程优化后的个结果。

pfn infer time: 1.89 ms.
mfn infer time: 0.84 ms.
rpn infer time: 2.38 ms.
pfn infer time: 1.07 ms.
mfn infer time: 0.75 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.07 ms.
mfn infer time: 0.74 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.89 ms.
mfn infer time: 0.82 ms.
rpn infer time: 2.38 ms.
pfn infer time: 1.78 ms.
mfn infer time: 0.79 ms.
rpn infer time: 2.38 ms.
pfn infer time: 2.26 ms.
mfn infer time: 0.96 ms.
rpn infer time: 2.46 ms.
pfn infer time: 1.89 ms.
mfn infer time: 0.84 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.30 ms.
mfn infer time: 0.77 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.88 ms.
mfn infer time: 0.80 ms.
rpn infer time: 2.38 ms.

整个网络部分的时间缩短到了5ms左右，当然，这是在我强大的2080 RTX上面。

昌山小屋

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
【模型加速】PointPillars模型TensorRT加速实验(6)

在【模型加速】PointPillars模型TensorRT加速实验(5)文末给出的实验结果表明，当前针对PFN的TensorRT的加速方案是有缺陷的，相比直接使用Pytorch GPU推理非但没有达到理想的加速效果，反而是更慢了。当你把PFN推理代码中最核心的3个步骤的时间消耗都打印出来你可以明显的看到这样一个事实：主机内存和设备内存之间的数据拷贝耗时比你想象中的要大得多。#synct1 = time.time()for hdm in inputs_hdm:cuda.memcpy...
复制链接

扫一扫