【模型加速】PointPillars模型TensorRT加速实验(5)

最新推荐文章于 2023-08-19 12:34:50 发布

昌山小屋

最新推荐文章于 2023-08-19 12:34:50 发布

阅读量1.4k

点赞数 4

分类专栏：点云处理 tensorrt 文章标签： PointPillars

本文链接：https://blog.csdn.net/chuigedaqiqiu/article/details/119083315

版权

点云处理同时被 2 个专栏收录

41 篇文章 48 订阅

订阅专栏

tensorrt

12 篇文章 5 订阅

订阅专栏

我们已经有了转换好的PFN和RPN部分的TensorRT engine，而MFN部分因为TensorRT暂不支持Scatter操作而没有转换成功。综合来看，PFN和RPN部分采用TensorRT来推理，而MFN部分因为本身并不复杂可以考虑直接使用其他操作代替Scatter操作，或者用CUDA重写MFN来推理都是可以的。好了，那我们就有条不紊的来完成各个部分的推理。先是PFN的TensorRT推理。

PFN TensorRT推理

TensorRT在GPU上推理需要手动分配设备内存，首先将输入数据从主机内存拷贝到设备内存，推理结果保存在分配的输出设备内存上，可以将其拷贝出来到主机内存。因为输入数据为dynamic shape，我先尝试每次推理是按需分配，由alloc_dynamic_buffer函数来实现。

    def alloc_dynamic_buffer(self,engine,pillar_num=None):
        inputs_ = []
        outputs_ = []
        bindings_ = []
        for binding in range(engine.num_bindings):
            shape = engine.get_binding_shape(binding)
            if len(shape) == 4 and shape[2] == -1:
                assert pillar_num is not None
                shape[2] = pillar_num
            elif len(shape) == 2 and shape[1] == -1:
                assert pillar_num is not None
                shape[1] = pillar_num
            else:
                pass 
            #print("binding: ", binding, ",shape: ", shape)
            size = trt.volume(shape)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            #Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype) #=>np.ndarray
            device_mem = cuda.mem_alloc(host_mem.nbytes) #=>pycuda.driver.DeviceAllocation
            bindings_.append(int(device_mem))
            if engine.binding_is_input(binding):
                inputs_.append(HostDeviceMem(host_mem,device_mem))
            else:
                outputs_.append(HostDeviceMem(host_mem,device_mem))
        return inputs_,outputs_,bindings_

每个input/output都要分配两组内存，主机内存和设备内存。cuda.pagelocked_empty接口用来分配主机内存缓冲区，因为不同精度占用的内存不同，所以需要知道数据type。设备内存由cuda.mem_alloc接口。其中host_mem.nbytes就是计算出来的字节数。

def nptype(trt_type):
    import numpy as np
    if trt_type == float32:
        return np.float32
    elif trt_type == float16:
        return np.float16
    elif trt_type == int8:
        return np.int8
    elif trt_type == int32:
        return np.int32
    raise TypeError("Could not resolve TensorRT datatype to an equivalent numpy datatype.")

cuda.Stream()用来创建一个CUDA流，关于CUDA流我这里可以简单总结以下几点：

一个CUDA流指的是由主机发出的在一个设备中执行的CUDA操作(即和CUDA有关的操作，如主机-设备数据传输和核函数执行)序列；
除主机端发出的流外，还有设备端发出的流。一个CUDA流中各个操作的次序是由主机控制的，按照主机发布的次序执行。也就是说同一个流里面的操作是有序的（FIFO），不可以重叠了。但不同的流里面的操作是无序的，可以重叠；
任何CUDA操作都存在于某个CUDA流中，要么是默认流(default stream)，要么是明确指定的非空流；

推理时，首先将输入数据由主机内存拷贝到输入设备内存，然后利用GPU进行推理，推理完成后的结果自动保存在输出设备内存中，我们可以将其拷贝到输出主机内存中来。无论是数据拷贝还是推理，都是异步执行，所以最后需要stream.synchronize()进行同步，等待上述操作执行完毕。

 1 def pfn_inference(self,engine,inputs):
  2         if self.pfn_context is None:
  3             self.pfn_context = engine.create_execution_context()
  4             self.pfn_context.active_optimization_profile = 0
  5         self.pfn_stream = cuda.Stream()
  6          
  7         pillar_num = inputs[0].shape[2]
  8         inputs_hdm,outputs_hdm,bindings_m = self.alloc_pfn_dynamic_buffer(engine)
  9         #infer
 10         input_cnt = 0
 11         out_shape = None 
 12         for n in range(engine.num_bindings):
 13             if engine.binding_is_input(n):
 14                 shape = engine.get_binding_shape(input_cnt)
 15                 if len(shape) == 4 and shape[2] == -1:
 16                     shape[2] = pillar_num
 17                 elif len(shape) == 2 and shape[1] == -1:
 18                     shape[1] = pillar_num
 19                 else:
 20                     raise Exception("invalid shape:", shape)
 21                 self.pfn_context.set_binding_shape(input_cnt, shape)
 22                 data = np.ascontiguousarray(inputs[input_cnt].reshape(-1))  #convert to contiguous memory 
 23                 inputs_hdm[input_cnt].host = data 
 24                 input_cnt += 1
 25             else:
 26                 out_shape = list(engine.get_binding_shape(n))
 27                 assert len(out_shape) == 4 and out_shape[2] == -1
 28                 out_shape[2] = pillar_num                                                                                                                                       
 29          
 30         for hdm in inputs_hdm:
 31             cuda.memcpy_htod_async(hdm.device,hdm.host,self.pfn_stream)
 32          
 33         self.pfn_context.execute_async(bindings=bindings_m,stream_handle=self.pfn_stream.handle)
 34          
 35         for hdm in outputs_hdm:
 36             cuda.memcpy_dtoh_async(hdm.host,hdm.device,self.pfn_stream)
 37         self.pfn_stream.synchronize()
 38          
 39         pillar_features = np.array(outputs_hdm[0].host).reshape(out_shape)
 40         return pillar_features

这样就完成了PFN部分的TensorRT推理，这里有一个明显的问题，就是每一次推理都要动态地分配主机和设备内存，这样太消耗时间。细想一下其实大可不必每次都分配，而是在第一次就分配出一个足够大的内存，后续重用已经分配的就好了。只是在计算输出返回的时候要计算出实际大小再做后续操作。

def alloc_pfn_max_buffer(self,engine):
        if self.pfn_allocated:
            return self.pfn_alloc_inputs,self.pfn_alloc_outputs,self.pfn_alloc_bindings
 
        for binding in engine:
            shape = engine.get_binding_shape(binding)
            if len(shape) == 4 and shape[2] == -1:
                shape[2] = self.max_num_pillars
            elif len(shape) == 2 and shape[1] == -1:
                shape[1] = self.max_num_pillars
            else:
                raise Exception("invalid shape:", shape)
            #print("binding: ", binding, ",shape: ", shape)
            size = trt.volume(shape)
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            #Allocate host and device buffers
            #host_mem.nbytes/size = 4
            host_mem = cuda.pagelocked_empty(size, dtype) #=>np.ndarray
            device_mem = cuda.mem_alloc(host_mem.nbytes) #=>pycuda.driver.DeviceAllocation
            self.pfn_alloc_bindings.append(int(device_mem))
            if engine.binding_is_input(binding):
                self.pfn_alloc_inputs.append(HostDeviceMem(host_mem,device_mem))
            else:
                self.pfn_alloc_outputs.append(HostDeviceMem(host_mem,device_mem))
            self.pfn_allocated = True
 
        return self.pfn_alloc_inputs,self.pfn_alloc_outputs,self.pfn_alloc_bindings

so，这样做相比Pytorch直接GPU推理加速效果如何呢?但就这个组件来说实验结论会让人有些失望。我简单地对比了一下以上TensorRT GPU推理和Pytorch GPU推理的速度:

	PFN
Pytorch(GPU)	0.53ms
TensorRT	6.50ms

直接基于 Pytorch(GPU)的推理速度远快于以上实验中的 PFN 加速方式。究其原因，我想主要有几点：

PFN网络结构本身极其简单；
以上加速方式数据经历了 cpu->gpu->推理->gpu->cpu 这个过程，而 cpu<-->gpu 的数据搬运占了大量的时间；
我环境中的软硬件，包括显卡、CUDA、TensorRT、ONNX、Python等版本可能没有找到一个比较优秀的组合；

【参考文献】

https://blog.csdn.net/qq_33120609/article/details/96578190

https://blog.csdn.net/Small_Munich/article/details/101559424

https://github.com/nutonomy/second.pytorch

https://forums.developer.nvidia.com/t/6-assertion-failed-convertdtype-onnxtype-dtype-unsupported-cast/179605/2

https://zhuanlan.zhihu.com/p/78882641

昌山小屋

关注

4
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
【模型加速】PointPillars模型TensorRT加速实验(5)

MFN转TensorRT引擎MFN CPU推理MFN GUDA InferenceRPN ONNX-TensorRT InferenceFPN TensorRT InferenceRPN TensorRT InferenceAfter-Process
复制链接

扫一扫