YOLOV5 TensorRT BatchedNMS

修改yolov5的detect层一文中,介绍了对detect层的轻量化改造,以提高模型服务的效率。在Triton Pipelines部署yolov5 service一文中利用改造的模型文件分别通过BLS和Ensemble两种方式部署了Triton Pipelines。但是Pipelines中的infer engine和nms始终是两个相对独立的step,由于nms是通过python backend来完成的,无论是BLS还是Ensemble都在数据传输方面存在一些限制。

本文利用onnx_graphsurgeon改造原生detect层的输出张量,对接通过cuda实现的TensorRT batchedNMSPlugin,将yolov5的nms集成到tensorrt engine中,避免部分场景下device to host的数据拷贝,提高整体计算性能。

相关代码链接

0. 前置条件

# clone ultralytics repo
git clone -b v6.1 https://github.com/ultralytics/yolov5.git
# clone this repo
git clone <this repo>
cp -r <this repo>/* yolov5/

1. 具体步骤

修改yolov5的detect层#3-具体步骤大致类似,都是遵循:

  • 修改detect层的forward函数
  • 导出.onnx文件
  • 转换为trt engine

的步骤。只不过这里需要对导出onnx文件的函数进行一些修改,新增一个BatchedNMSDynamic_TRT node并追加到原始graph的末尾, 并按照TensorRT batchedNMSPlugin的输入格式调整node的属性

1.1 修改前后

  • infer模式下forward函数原始输出格式

    • squeezed boxes and classes:

      [batch_size, number_boxes, box_xywh + c + number_classes] = [batch_size, 25200, 85]
      

  • 修改后的输出格式

    • boxes

      [batch_size, number_boxes, 1, x1y1x2y2]
      
    • cls_conf

      [batch_size, number_boxes, number_classes]
      

      请添加图片描述

      根据batchedNMSPlugin.cpp源码中的注释,boxes的输入形状为[batch_size, num_boxes, num_classes, 4] or [batch_size, num_boxes, 1, 4],但batchedNMSPlugin的文档没有详细说明这二者的差别,在
      efficientNMSPlugin的文档里可以找到相关的解释:

      The boxes input can have 3 dimensions in case a single box prediction is produced for all classes (such as in EfficientDet or SSD), or 4 dimensions when separate box predictions are generated for each class (such as in FasterRCNN), in which case number_classes >= 1 and must match the number of classes in the scores input. The final dimension represents the four coordinates that define the bounding box prediction.

      由于使用的是yolov5, 所以不会对每个类别去生成bouding box, 所以boxes的输入形状应该为[batch_size, num_boxes, 1, 4]

1.2 改造detect层

yolov5 Detect层forward函数的输出改成TensorRT batchedNMSPlugin的输入格式

def forward(self, x):
    z = []  # inference output
    for i in range(self.nl):
        x[i] = self.m[i](x[i])  # conv
        bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
        x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

        if not self.training:  # inference
            if self.onnx_dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

            y = x[i].sigmoid()
            if self.inplace:
                y[..., 0:2] = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
            else:  # for YOLOv5 on AWS Inferentia https://github.com/ultralytics/yolov5/pull/2953
                xy = (y[..., 0:2] * 2 - 0.5 + self.grid[i]) * self.stride[i]  # xy
                wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh
                # custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
                conf = y[..., 4:]
                xmin = xy[..., 0:1] - wh[..., 0:1] / 2
                ymin = xy[..., 1:2] - wh[..., 1:2] / 2
                xmax = xy[..., 0:1] + wh[..., 0:1] / 2
                ymax = xy[..., 1:2] + wh[..., 1:2] / 2
                obj_conf = conf[..., 0:1]
                cls_conf = conf[..., 1:]
                cls_conf *= obj_conf 
                # y = torch.cat((xy, wh, y[..., 4:]), -1)
                y = torch.cat((xmin, ymin, xmax, ymax, cls_conf), 4)
            # z.append(y.view(bs, -1, self.no))
            z.append(y.view(bs, -1, self.no - 1))
    
    z = torch.cat(z, 1)
    bbox = z[..., 0:4].view(bs, -1, 1, 4)
    cls_conf = z[..., 4:]
    
    return bbox, cls_conf
    # return x if self.training else (torch.cat(z, 1), x)
    # custom output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

1.3 修改export_onnx

export onnx时,修改output,满足TensorRT batchedNMSPlugin的输入格式

这里介绍一下关键点,详细代码见export.py中的export_onnx函数

  • onnx simplify的时候避免导出成static shape

    model_onnx, check = onnxsim.simplify(
        model_onnx,
        dynamic_input_shape=dynamic
        # 必须注释
        #input_shapes={'images': list(im.shape)} if dynamic else None
        )
    
  • 利用onnx-graphsurgeon创建一个BatchedNMSDynamic_TRT node,并添加到原有计算图的末尾

    # add batch NMS:
    yolo_graph = onnx_gs.import_onnx(model_onnx)
    box_data = yolo_graph.outputs[0]
    cls_data = yolo_graph.outputs[1]
    nms_out_0 = onnx_gs.Variable(
        "BatchedNMS",
        dtype=np.int32
    )
    nms_out_1 = onnx_gs.Variable(
        "BatchedNMS_1",
        dtype=np.float32
    )
    nms_out_2 = onnx_gs.Variable(
        "BatchedNMS_2",
        dtype=np.float32
    )
    nms_out_3 = onnx_gs.Variable(
        "BatchedNMS_3",
        dtype=np.float32
    )
    nms_attrs = dict()
    # ........
    
    nms_plugin = onnx_gs.Node(
        op="BatchedNMSDynamic_TRT",
        name="BatchedNMS_N",
        inputs=[box_data, cls_data],
        outputs=[nms_out_0, nms_out_1, nms_out_2, nms_out_3],
        attrs=nms_attrs
    )
    yolo_graph.nodes.append(nms_plugin)
    yolo_graph.outputs = nms_plugin.outputs
    yolo_graph.cleanup().toposort()
    model_onnx = onnx_gs.export_onnx(yolo_graph)
    
  • 依次导出onnx和tensorrt engine

    # export onxx
    python export.py --weights yolov5s.pt --include onnx --simplify --dynamic 
    
    # export trt engine
    /usr/src/tensorrt/bin/trtexec  \
    --onnx=yolov5s.onnx \
    --minShapes=images:1x3x640x640 \
    --optShapes=images:1x3x640x640 \
    --maxShapes=images:1x3x640x640 \
    --workspace=4096 \
    --saveEngine= yolov5s_opt1_max1_fp16.engine \
    --shapes=images:1x3x640x640 \
    --verbose \
    --fp16 \
    > result-FP16-BatchedNMS.txt
    

2. 性能测试

2.1 COCO17 validation数据集测试

对比测试infer + nms的耗时

  • original yolo

    python detect.py --weight original-yolov5s-fp16.engine --half --img 640  --source </path/to/coco/images/val2017/> --device 0 
    

    Speed: 0.8ms pre-process, 4.4ms inference, 2.2ms NMS per image at shape (1, 3, 640, 640)

  • batchedNMSPlugin

    python trt_infer.py --model yolov5s_opt1_max1_fp16.engine --input_images_folder </path/to/coco/images/val2017/> --output_images_folder <output_tempfolder> --input_size 640
    

    infer + nms:
    Inference: 5.4 ms per image at shape (1, 3, 640, 640)

2.2 trtexec 测试

trtexec是本地测试结果,batch为1的情况下,整体差别不太大,将nms集成到trt engine后,Output的张量变小了很多,可以降低Device to Host的数据传输时间,代价是GPU Compute的时间增加

metricsBatchedNMSDynamic_TRT egine
infer+nms
ultralytics engine
only infer
Latency3.97021 ms4.08145 ms
End-to-End Host Latency6.70715 ms4.73285 ms
Enqueue Time1.27597 ms0.95929 ms
H2D Latency0.563791 ms0.316406 ms
GPU Compute Time3.45068 ms2.41992 ms
D2H Latency0.0100889 ms1.34198 ms

REFERENCES

  • 6
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值