【模型加速】PointPillars模型TensorRT加速实验(6)

        在【模型加速】PointPillars模型TensorRT加速实验(5)文末给出的实验结果表明,当前针对PFN的TensorRT的加速方案是有缺陷的,相比直接使用Pytorch GPU推理非但没有达到理想的加速效果,反而是更慢了。当我把PFN推理代码中最核心的3个步骤的时间消耗都打印出来你可以明显的看到这样一个事实:主机内存和设备内存之间的数据拷贝耗时比你想象中的要大得多。

#sync
t1 = time.time()
for hdm in inputs_hdm:
cuda.memcpy_htod(hdm.device,hdm.host)
print("pfn memcpy host->device time: {:.2f} ms.".format((time.time()-t1)*1000))
 
t1 = time.time()
self.pfn_context.execute_v2(bindings=bindings_m)
print("pfn execute_v2 time: {:.2f} ms.".format((time.time()-t1)*1000))
 
t1 = time.time()
for hdm in outputs_hdm:
    cuda.memcpy_dtoh(hdm.host,hdm.device)
print("pfn memcpy device->host time: {:.2f} ms.".format((time.time()-t1)*1000))

########################
points_path:  /data/sets/kitti_second/training/velodyne/000001.bin
pfn memcpy host->device time: 6.46 ms.
pfn execute_v2 time: 1.50 ms.
pfn memcpy device->host time: 0.25 ms.
########################
points_path:  /data/sets/kitti_second/training/velodyne/000001.bin
pfn memcpy host->device time: 6.47 ms.
pfn execute_v2 time: 1.52 ms.
pfn memcpy device->host time: 0.26 ms.

实验走到这一步,会不会感觉在当前的环境上(x86+2080 rtx)上PointPillars模型按照既定的方案在推理速度上要超越Pytorch(GPU)基本上很难了,加速实验变成了"去Pytorch"实验。首先,PFN这一段就被Pytorch(GPU)甩在了后面,然后到了MFN这里。因为TensorRT还不支持Scatter操作,所以要用其他替代方案。其实也是这个原因导致网络被人为分割成3段。嘿嘿,不要着急。先按照既定方案走下去,只要思想不滑坡,办法总比困难多。

MFN推理

        不考虑使用Pytorch来推理,MFN至少有两个个直接了当的方式:(1).由numpy通过索引操作在cpu上实现;(2).由cuda在gpu上实现。

方法一:

   def mfn_inference(self,voxel_features,coords,nchannels=64,nx=432,ny=496):
        indices = coords[:, 1] * nx + coords[:, 2] #coord=>(z,y,x)
        voxel_features = voxel_features.squeeze() #shape[1,64,N,1]=>shape[64,N]
        canvas = np.zeros(shape=(nchannels, nx*ny), dtype=voxel_features.dtype)
            
        canvas[:, indices] = voxel_features
        canvas = canvas.reshape((1, nchannels, ny, nx))
        return canvas

推理:

mfn infer time: 16.88 ms.
mfn infer time: 18.57 ms.
mfn infer time: 19.00 ms.
mfn infer time: 18.35 ms.
mfn infer time: 18.39 ms.
mfn infer time: 18.44 ms.
mfn infer time: 18.38 ms.
mfn infer time: 18.71 ms.
mfn infer time: 18.41 ms.
mfn infer time: 18.39 ms.

方法二:

def mfn_inference_by_cuda(self,voxel_features,coords,nchannels=64,nx=432,ny=496):
        t1 = time.time()
        indices = coords[:, 1] * nx + coords[:, 2]
        indices_d = gpuarray.to_gpu(indices)
        canvas_d = gpuarray.zeros(shape=(nchannels, nx*ny), dtype=voxel_features.dtype)
        voxel_features = voxel_features.squeeze()
        voxel_features_d = gpuarray.to_gpu(voxel_features)
        
        scatter = g_module.get_function("scatter")
        t2 = time.time()
        scatter(indices_d,
            np.int32(nchannels),
            canvas_d, 
            np.int32(nx*ny),
            voxel_features_d,
            np.int32(voxel_features.shape[-1]),
            grid=(64, nx*ny // 512),
            block=(1, 512, 1))
        print("mfn infer scatter time: {:.3f} ms.".format((time.time()-t2)*1000.))
 
        canvas_d = canvas_d.reshape(1, nchannels, ny, nx)
        canvas = canvas_d.get()
        print("mfn infer time: {:.3f} ms.".format((time.time()-t1)*1000.))
 
        return canvas

推理:

mfn infer scatter time: 0.070 ms.
mfn infer time: 19.826 ms.
mfn infer scatter time: 0.089 ms.
mfn infer time: 20.053 ms.
mfn infer scatter time: 0.061 ms.
mfn infer time: 20.396 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 20.017 ms.
mfn infer scatter time: 0.060 ms.
mfn infer time: 20.498 ms.
mfn infer scatter time: 0.072 ms.
mfn infer time: 19.972 ms.
mfn infer scatter time: 0.061 ms.
mfn infer time: 20.596 ms.
mfn infer scatter time: 0.064 ms.
mfn infer time: 19.953 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 20.493 ms.
mfn infer scatter time: 0.062 ms.
mfn infer time: 19.976 ms.
方法一使用numpy在cpu上实现,不存在数据在cpu内存和gpu显存之间的流动,推理一帧的平均耗时>10ms,这当然是太慢了。而方法二虽然核心Scatter操作速度很快(<1ms),但一旦加上把数据从gpu显存copy回内存这一过程,则完全不占优势。so,加速的思路明显,优化流程。推理过程中尽可能让数据在GPU显存上流动,上一阶段的输出,简单处理后送入下一个阶段作为输入。这里,我先贴一个流程优化后的个结果。

pfn infer time: 1.89 ms.
mfn infer time: 0.84 ms.
rpn infer time: 2.38 ms.
pfn infer time: 1.07 ms.
mfn infer time: 0.75 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.07 ms.
mfn infer time: 0.74 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.89 ms.
mfn infer time: 0.82 ms.
rpn infer time: 2.38 ms.
pfn infer time: 1.78 ms.
mfn infer time: 0.79 ms.
rpn infer time: 2.38 ms.
pfn infer time: 2.26 ms.
mfn infer time: 0.96 ms.
rpn infer time: 2.46 ms.
pfn infer time: 1.89 ms.
mfn infer time: 0.84 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.30 ms.
mfn infer time: 0.77 ms.
rpn infer time: 2.37 ms.
pfn infer time: 1.88 ms.
mfn infer time: 0.80 ms.
rpn infer time: 2.38 ms.

整个网络部分的时间缩短到了5ms左右,当然,这是在我强大的2080 RTX上面。

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值