4W字长文 | Sparse4Dv3的TensorRT部署调优指南~-CSDN博客

作者 | Thomas Wilson 编辑 | 自动驾驶Daily

原文链接：https://zhuanlan.zhihu.com/p/715179777

点击下方卡片，关注“自动驾驶之心”公众号

本文只做学术分享，如有侵权，联系删文

Brief

本文分享主题：如何在个人工作站以及车载NVIDIA ORIN 上部署Sparse4Dv3端到端感知方案。Sparse4D是基于稀疏Transformer范式的高性能高效率的长时序融合的感知算法。截止2024年9月，该方案在nuscenes纯视觉榜单上以mAP=0.668排名位列第一：

图一：nuscenes纯视觉榜单

以下为算法的通用架构图[1]：

图二：基于时序融合稀疏Transformer范式算法算法架构

当前，我测试了该算法在多种数据集上的泛化效果，感觉确实还不错。基于此，我重构了源代码，主要工作：剔除了源代码对MMDetection3D和mmcv-full的依赖，这使得该算法更加轻量化，无论是训练、推理还是部署都变的更加友好。目前，该仓库测试下来，无论是在个人的工作站还是远程集群环境（这里主要测试了NVIDIA Ampere架构系列显卡、NVIDIA Volta架构系列显卡还有Hopper架构显卡），其安装简单无复杂依赖（如需升级，只要更新CUDA和torch相关组件版本即可），轻松完成大规模训练和推理任务。

其实对于大部分自动驾驶量产公司来说，当模型的算法方案确定后，数据的采集到标注爬坡是需要一定周期的，日常工作除了数据的处理和模型调优迭代外，大部分精力需要投入到模型的部署、模型的高性能推理和模型的车端C++代码开发，毕竟能上车才是王道。

本人在对Sparse4Dv3模型部署和C++代码开发过程中，感觉并不是一番风顺。因此，我想借这篇文章将个人踩过的所有坑和经验分享给大家，希望大家在使用过程中尽量少走弯路。号外号外，感兴趣的朋友可以clone我的GitHub Repo : SparseEnd2End。部署代码请大家关注仓库中的deploy文件夹，该代码目前已经开源。车端C++代码在持续更新填充中，请大家关注onboard文件夹。好记性，不如烂笔头，本文会长期更新，内容将会涉及后续的调优以及持续加速......

github.com/ThomasVonWu/SparseEnd2End

大家在使用该仓库过程中如果遇到什么问题也请分享给我，共同学习进步嘛。

在阅读本文之前，需具备以下基础知识和相关工具使用经验，这将方便大家快速理解本篇文章的核心内容（抱歉，时间有限，后续涉及基础工具的使用方法我就不详细展开了）。另外，本篇文章的部署方案无需依赖其他部署仓库，如：MMdeploy, etc：

PyTorch模型导出ONNX中间格式的方法，熟悉可视化ONNX节点工具: netron的使用；
推理引擎 ONNX Runtime 和 TensorRT的安装及使用方法，包括：TensorRT python API 和 C++ API使用；
熟悉TensorRT 工具Polygraphy的python API使用方法和基本的命令行调用指令;
基本的CUDA编程知识：核函数的编写与启动，常用的内存模型：全局内存、共享内存，etc；
Makefile编程语法，C++编译规则和nvcc 编译cuda程序规则；

通过这篇文章你将了解并学习到以下进阶知识：

模型转换完，如何确保数据流的无损传递以及推理结果的一致性验证方法，即：如何debug问题和校验结果准确性 (我想这是大家最关心的部分，毕竟网上大几百的课程也未必会详细的告诉大家这些方法)；
如何在ONNX注册一个自定义算子？紧接着，如何将该自定义算子注册为TensorRT Plugin，最后你一定关心：如何将该自定义plugin加载到当前模型的engine中，并使用TensorRT python API完成python脚本推理链路和TesnsorRT C++ API完成车端推理链路;

好的，一切就绪，现在让我们愉快的开始吧～

Sparse4Dv3 Model Deployment Pipeline

本文配置的部署环境以及使用的工具版本，如下：

====================================================================================================================
||  Config Environment Below:
||  UBUNTU              : 20.04
||  TensorRT LIB        : /mnt/env/tensorrt/TensorRT-8.5.1.7/lib
||  TensorRT INC        : /mnt/env/tensorrt/TensorRT-8.5.1.7/include
||  TensorRT BIN        : /mnt/env/tensorrt/TensorRT-8.5.1.7/bin
||  CUDA_LIB            : /usr/local/cuda-11.6/lib64
||  CUDA_ INC           : /usr/local/cuda-11.6/include
||  CUDA_BIN            : /usr/local/cuda-11.6/bin
||  CUDNN_LIB           : /mnt/env/tensorrt/cudnn-linux-x86_64-8.6.0.163_cuda11-archive/lib
||  CUDASM              : sm_86
||  PYTORCH             : 1.13.0
||  ONNX                : 1.14.1
||  ONNXRUNTIME         : 1.15.0
||  ONNXSIM             : 0.4.33
||  CUDA-PYTHON         : 1.15.0
||  NETRON              : 7.7.8
||  POLYGRAPHY          : 0.49.9
====================================================================================================================

部署的大体思路如下：

Sparse4Dv3模型部署的过程需要将我们将训练好的模型.pth 文件，转换为中间文件.onnx，最后转换为*.engine 文件。该过程需要解决PyTorch模型与ONNX框架算子的兼容性以及模型运行加速两大需求; 模型pth下载链接：https://drive.google.com/file/d/1sSMNB7T7LPKSr8nD9S_tSiu1mJrFMZ1I/view?usp=sharing**
Sparse4Dv3 PyTorch模型实际上就是一个计算图。模型部署时通常需要我们将模型转换成静态的计算图，即没有控制流（分支语句、循环语句）的计算图，这点很重要，Sparse4Dv3 Head模型转换过程我们就会遇到这个问题；
PyTorch 框架自带对 ONNX 的支持，只需要我们构造一组随机的输入，并对模型调用 torch.onnx.export 即可，完成 PyTorch 到 ONNX 的转换。
*推理引擎 ONNX Runtime 对 ONNX 模型有原生的支持，提供了python API 和 C++ API。给定一个 .onnx 文件，只需简单使用 ONNX Runtime 的 Python API 就可以完成模型推理。依据上述工具的使用，我们可以完成PyTorch到ONNX的推理一致性验证。
推理引擎 TensorRT 提供了Python API 和 C++ API。给定一个 .engine 文件，只需要简单使用 TensorRT 的 Python API 就可以在python脚本中完成模型推理。依据上述构造过程，我们可以完成PyTorch到ONNX的推理一致性验证。最后，通过调用C++ API就可以完成代码在车端仓库的部署了(简单点说，就是将TensorRT的python API做一次C++ API接口映射即可)。*

部署工作开始前，我们首先分析下Sparse4Dv3 PyTorch模型结构，这里以配置：输入img_size : 256x704, 模型backbone : Resnet50, 模型精度Precision : fp32为例，模型结构大体可以拆分成两个部分：imgBackbone和sparseTrans formerHead。

1）imgBackbone由Resnet50+FPN组成，详细结构如下：

(imgBackbone): ResNet(
    (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ReLU(inplace=True)
    (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (layer1): ResLayer(
      (0): Bottleneck(
        (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (downsample): Sequential(
          (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (1): Bottleneck(...)
      ...
    )
    ***********************************************************Resnet50重复堆叠的模块我这里省略了哈
  )
  init_cfg={'type': 'Pretrained', 'checkpoint': 'ckpt/resnet50-19c8e357.pth'}
  (img_neck): FPN(
    (lateral_convs): ModuleList(...)
    (fpn_convs): ModuleList(
      (0): ConvModule(...)
  )
  init_cfg={'type': 'Xavier', 'layer': 'Conv2d', 'distribution': 'uniform'})

2）sparseTrans formerHead 由 anchorencoder + 39个Op构成的ModuleList组成，其中，不包含instance_bank layer, loss_cls, loss_reg, loss_box, loss_cns, etc. 原因如下：

instance_bank layer: 该模块主要功能，缓存历史帧instance实例，并将历史帧anchor投影到当前帧，最后更新trackid和confidence，而且，该模块并未包含有效的nn.Module部分(这里有效指的是训练过程并没有需要更新梯度的tensor变量，也就不涉及到权重和偏置组成部分)，所以为了成功导出ONNX，后续会将该部分从head剥离开；
模型部署的数据流走的是推理链路，所以和训练相关的模块需要全都丢弃，这里包含：loss_cls, loss_reg, loss_box, loss_cns, depth_branch, grid_mask, etc；

(sparseTransformerHead): Sparse4DHead(
    (anchor_encoder): SparseBox3DEncoder(
      (pos_fc): Sequential(
        (0): Linear(in_features=3, out_features=128, bias=True)
        (1): ReLU(inplace=True)
        (2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (3): Linear(in_features=128, out_features=128, bias=True)
        (4): ReLU(inplace=True)
        (5): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (6): Linear(in_features=128, out_features=128, bias=True)
        (7): ReLU(inplace=True)
        (8): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (9): Linear(in_features=128, out_features=128, bias=True)
        (10): ReLU(inplace=True)
        (11): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      )
*********************************************************size_fc/yaw_fc/vel_fc我省略了哈 
    )
    (layers): ModuleList(
      (0): DeformableAttentionAggr(...)
      (1): AsymmetricFFN(...)
      (2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (3): SparseBox3DRefinementModule(...)
        (cls_layers): Sequential(...)
        (quality_layers): Sequential(...)
      (4): MultiheadAttention(...)
      (5): MultiheadAttention(...)
      ***********************************************************head重复的op我也省略了哈  
    )
    (fc_before): Linear(in_features=256, out_features=512, bias=False)
    (fc_after): Linear(in_features=512, out_features=256, bias=False))

Sparse4Dv3 SparseTransfomerHead Deployment Solution

核心内容，之所以先介绍head部署方案，一方面，因为head是算法的核心部分，和业务需求紧密联系，另一方面。head结构复杂，部署的主要工作量体现在这一部分。相比较imgBackbone，推理前向过程需要我们很清晰的掌握head模块的数据流向以及算法机理。

一)DFA自定义CUDA算子的部署方案

首先head在转换ONNX中间格式时，我们不难发现，其中存在自定义算子DFA(Deformable Feature Aggregation)，其调用函数为deformable_aggregation_function()。很明显，该算子借用pybind11构建共享库完成了C++和CUDA扩展的自定义算子，从而解决PyTorch原算子并行问题，进而减小训练/推理显存，加速训练/推理速度。PyTorch的C++和CUDA扩展自定义算子一般流程如下：

第一步：使用C++编写算子的forward函数和backward函数;
第二步：将该算子的forward函数和backward函数使用pybind11绑定到python上;
第三步：使用setuptools/JIT/CMake编译打包C++工程为*.so文件;
第四步：在python端继承PyTorch的torch.autograd.Function类，实现静态函数forward和backward，并调用上述过程生成的动态库*.so;

Q∶什么是DFA算子，它又具备什么样的功能?论文中这样写道：[2]

图三：DFA算子将原始的3D关键点：固定关键点和可学习关键点，与图像特征对齐，从而提取有效的图像特征

DFA算子实现流程：

第一步：首先，将900个query instance 的 13（6KFP+7KLP）个3D关键点全部投影current timestamp的FeatureMap上，接下来通过双线性插值方法在MultiView FeatureMap上进行特征采样;
第二步：紧接着，在MultiView的不同的MultiScale层（即不同分辨率的特征图层）上重复执行双线性插值操作，以捕获从粗粒度到细粒度的特征特区，这有助于模型在不同尺度上理解物体的结构和细节;
第三步：最后，网络会使用预测的权重（通过线性层计算得到）进行加权，完成特征的聚合。这确保算法具备了：根据检测任务的贡献大小，对不同视角和尺度的特征进行合适的特征融合;

P.S.其中KFP为固定关键点，KLP为可学习关键点，文中DeformableAttention Aggr,DeformableFeatureAggregation, DFA都指的同一个自定义算子

Q:为什么需要对DFA算子做CUDA加速?

HBM(High Bandwidth Memory)[3]高带宽存储，是一种常用显存介质。顾名思义，这个存储介质有着"High Bandwidth"。在多视角多尺度特征聚合过程中，DFAOp涉及了多次HBMIO过程，因此，存储了大量的临时变量。训练过程梯度回传以及推理过程频繁的HBM访问，一方面，其占用了大量显存。另一方面，频繁IO降低了推理速度。原码中调用的是torch.nn.functional. ample[4] 接口，由于PyTorch自带的该算子底层没有做并行运算的加速优化，所以，在训练或推理过程中，我发现3090的显卡都根本吃不消，显存直接爆炸了，核心代码调用如下：

for fm in feature_maps:
    features.append(nn.functional.grid_sample(fm.flatten(end_dim=1), points_2d))
    features = torch.stack(features, dim=1)
    features = features.reshape(
        bs, num_cams, num_levels, -1, num_anchor, num_pts).permute(0, 4, 1, 2, 5, 3)

针对上述问题，作者提出了加速的需求，并给出了加速方案：

图四：使用CUDA多线程并行化加速DFA特征聚合

DFA算法加速流程：

第一步：将基本可变形聚合中的MultiScale/MultiView维度输入，同可学习的权重系数加权操作，作为CUDA线程的原子操作;
第二步：在K×C的维度上分配线程，实现K, C维度的完全的并行化计算;

核心代码见位置：modules/ops/src/deformable_aggregation_cuda.cu，摘取部分如下：

__global__ void deformable_aggregation_kernel(
    const int num_kernels,
    float* output,
    const float* mc_ms_feat,
    const int* spatial_shape,
    const int* scale_start_index,
    const float* sample_location,
    const float* weights,
    int batch_size,
    int num_cams,
    int num_feat,
    int num_embeds,
    int num_scale,
    int num_anchors,
    int num_pts,
    int num_groups
) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= num_kernels) return;

    const float weight = *(weights + idx / (num_embeds / num_groups));
    const int channel_index = idx % num_embeds;
    idx /= num_embeds;
    const int scale_index = idx % num_scale;
    idx /= num_scale;

    const int cam_index = idx % num_cams;
    idx /= num_cams;
    const int pts_index = idx % num_pts;
    idx /= num_pts;

    int anchor_index = idx % num_anchors;
    idx /= num_anchors;
    const int batch_index = idx % batch_size;
    idx /= batch_size;

    anchor_index = batch_index * num_anchors + anchor_index;
    const int loc_offset = ((anchor_index * num_pts + pts_index) * num_cams + cam_index) << 1;

    const float loc_w = sample_location[loc_offset];
    if (loc_w <= 0 || loc_w >= 1) return;
    const float loc_h = sample_location[loc_offset + 1];
    if (loc_h <= 0 || loc_h >= 1) return;
    
    int cam_scale_index = cam_index * num_scale + scale_index;
    const int value_offset = (batch_index * num_feat + scale_start_index[cam_scale_index]) * num_embeds + channel_index;

    cam_scale_index = cam_scale_index << 1;
    const int h = spatial_shape[cam_scale_index];
    const int w = spatial_shape[cam_scale_index + 1];

    const float h_im = loc_h * h - 0.5;
    const float w_im = loc_w * w - 0.5;

    atomicAdd(
        output + anchor_index * num_embeds + channel_index,
        bilinear_sampling(mc_ms_feat, h, w, num_embeds, h_im, w_im, value_offset) * weight
    );
}

由于单个线程的计算负载是 N X S ，而每个点最多投影到两个视角，因此，计算复杂度最多2S。到这里，我们知道了为什么需要对DFA算子做CUDA加速以及如何进行加速，那这些原理对实际部署又有什么影响呢？根据作者的加速方案，其实，我们可以进一步的优化该算子，比如，减少HBMIO次数、引入share_memory、使用半精度CUDA fp16等方法。以上内容将会在我的下一篇文章中做详细介绍，大家感兴趣的可以关注我的仓库日志更新哦～

P.S.其中N为使用的环视相机的数量，S为多尺度的层数，K为关键点的数量

Q:DFA算子怎么部署呢?具体开发流程是什么?

这里先说结论：

首先，在PyTorch代码中使用符号函数：symbolic：构建DFA自定义算子PyTorch 到 ONNX 映射规则;
其次，将带有自定义算子DFA，PyTorch 到 ONNX 映射规则的*.pth模型文件转化为ONNX中间文件;
最后，基于指定版本的TensorRT C++ 注册plugin的 API，构建自定义DFA plugin，构建编译规则，编译该plugin，在本地将会生成动态库*.so文件;
对比验证PyTorch推理结果和TensorRT推理结果，确保结果的一致性;
通过trtexec command line 加载上述动态库到目标trt engine中，生成带有plugin的目标模型engine;

这里，不得不提下我之前幻想能够实现部署目标的几个解决方案的思路，以及后续为什么我最终敲定了上述方案：

被我舍弃的思路一：拆分法

由于DFA是自定义算子，这意味着：PyTorch的ATen[5]基本库没有官方实现、ONNX节点Op不支持该算子、TensorRT(包含官方实现的Plugin)不支持该算子。因此，在转换ONNX过程必定面临失败，更别提转换TensorRT Engine了。那么比较容易想到的方法是：在自定义算子出现的位置，我们把模型逐个拆分开，最后，在C++车端代码中逐个把模型串联起来不就可以了？一般来说这种方案没啥问题，但在Sparse4Dv3中，我们不推荐这么做，请看下面：

图五：Sparse4D Head 单次推理DFA出现次数

很显然，DFAOp单帧推理出现了6次，若要拆分模型，单单Head部分就会1拆7，这不是越搞越复杂吗？？思路一显然太笨不合适！

被我舍弃的思路二：集成Custom DFA Plugin放入TensorRT OSS整编

TensorRT OSS[6](TensorRT Open Source Software) 是NVIDIA 高性能推理SDK仓库。我们先指定TensorRT版本，然后clone下来，在plugin文件我们可以找到官方实现的plugin。到了这里，很多博客还有视频教程会告诉我们，参考官方自定义Plugin的方法实现自定义Plugin，然后在对应的API中注册新的算子名，最后，整体编译生成新的libnvinfer.so替换原始动态库完成自定义plugin的注册。听起来逻辑清晰，好像没有什么问题，但是，实际操作下来发现问题太多了：

辛辛苦苦的把DFA算子按规则注册好了，准备开始编译，紧接着，各种依赖问题出现。解决了一上午，诶？还是有环境问题，这时候的你冷静思考了一下：好像关于算子的开发的工作，以及推理正确与否还没开始验证呀？我一上午在干啥呢？
下午的你有些焦虑，先打开google ，把问题搜索一下？额，什么都搜不到，心想：全世界就我一个人遇到了这个问题吗，不是吧？？？忍不了了，于是你开始在issue上提问，同时你打开google翻译，精心准备好问题，并确定没有语法错误，激动的把问题提交去了，坐等仓库维护者帮忙解答。不知不觉过了几天，你很烦躁，心想：他们是太忙了吗？还是有时差呀？又或者放假了？怎么现在还没有回复我的问题，怎么办？我再试试自己解决一下，突然一条消息弹出，issue有更新啦，但你赫然看到的是那句熟悉又恼火的答复：How about using the latest version TensorRT?

哈哈，这个情况不知道大家遇到的多不多，我反正经常遇到。即使这些问题在本地工作站解决了，NVIDIA ORIN 部署还是有隐患的。即使，我们在本地编译通过或者升级TensorRT完美避开旧板本的bug，最终生成新的libnvinfer.so，可是，这玩意部署到NVDIA ORIN上也用不了呀，架构不同！要么，我们在本地借用交叉编译工具完成基于ARM的共享库生成。要么，我们直接上ORIN开发板编译。且不说编译速度慢吧，谁能确保不会遇到其他环境问题？这工作量目测是个无底洞，而关于plugin开发验证工作我们实际还未开展起来，显然侧重点有问题呀！抱歉，这个思路也不适合我。

被我舍弃的思路三：借用其他开源的仓库

借用仓库，如：tensorRT_Pro | tensorrtx | torch2trt 等。一方面需要了解并学习其使用方法。另外，也不能保证这些仓库的维护者一直持续更新，及时修复里面的BUG。最重要的是，自动驾驶使用的相关自定义算子，上述仓库大概率也没有呀（大部分是关于2D目标检测、目标分类和分割相关的算法部署方案）。所以说，准备工作挺耗费时间的，性价比不高。

基于上面的思考，最适合的方案还是一开始给出的结论最靠谱。言归正传，

第一步：建立PyTorch算子与ONNX节点映射关系：

图六：使用symbolic符号函数以及g.op()搭建算子映射关系

第二步：生成ONNX中间文件。有了映射关系，我们就可以生成ONNX文件了，netron可视化如下：

图七：Custom op：DeformableAttentionAggrPlugin ONNX node

这里的ONNX文件是无法成功转换trt engine，因为，ONNX中的DeformableAttentionAggrPlugin节点只是一个接口，没有对应的TensorRT Op实现的。因此，下一步需要在TensorRT中注册该算子，并将核心实现填充进去。

第三步：Custom Operator Plugin: DeformableAttentionAggrPlugin注册(我用最简单的话来解释)：

打开浏览器搜索github 仓库 TensorRT，指定版本8.5，打开plugin文件夹，这里以官方的实现的第一个plugin: batchTilePlugin为例:

图八：以batchTilePlugin为例模仿搭建DeformableAttentionAggrPlugin

我们需要构建四个文件分别为：

deformableAttentionAggrPlugin.h
deformableAttentionAggrPlugin.cpp
deformableAttentionAggrPlugin.cu
Makefile

P.S.其中我们会多一个文件: deformable Attention AggrPlugin.cu，这个文件为CUDA实现的核心代码，其实和Plugin注册没什么关系，是CustomOperator :DFA实现的逻辑代码。至于使用Make file还是CMakeLists编译，依据习惯就可，无特别要求

先说第一个文件，头文件一般用来定义接口，注册自定义plugin需要构建三部分：

part1: 注册Plugin名称DeformableAttentionAggrPlugin，它需和ONNX节点保持一致

static const char* PLUGIN_NAME{"DeformableAttentionAggrPlugin"};
static const char* PLUGIN_VERSION{"1"};

part2: 定义Plugin类：DeformableAttentionAggrPlugin，这里包含了核心实现函数enqueue，除了这个函数需要我们定义实现，其他函数基本都是依据模板直接套用（是不是很简单），DeformableAttentionAggrPlugin类具体又可以划分三部分组成，如下：

/// @brief PART1: Custom Plugin Class: DeformableAttentionAggrPlugin -> nvinfer1::IPluginV2DynamicExt Methods
    /*
     * clone()
     * getOutputDimensions()
     * supportsFormatCombination()
     * configurePlugin()
     * getWorkspaceSize()
     * enqueue() /// 它是核心哦！
     * attachToContext()
     * detachFromContext()
     */

/// @brief PART2: Custom Plugin Class: DeformableAttentionAggrPlugin -> nvinfer1::IPluginV2Ext Methods
    /*
     * getOutputDataType()
     */

/// @brief PART3: Custom Plugin Class: DeformableAttentionAggrPlugin -> nvinfer1::IPluginV2 Methods
    /*
     * getPluginType()
     * getPluginVersion()
     * getNbOutputs()
     * initialize()
     * getSerializationSize()
     * serialize()
     * destroy()
     * terminate()
     * setPluginNamespace()
     * getPluginNamespace()
     */

part3: 定义Plugin创建类：DeformableAttentionAggrPluginCreator，这部分没什么好说的，直接套用模板，改改变量名称就可以了：

/// @brief Second define a PluginCreator Class：DeformableAttenionAggrPluginCreator -> IPluginCreator
/*
 * DeformableAttentionAggrPluginCreator()
 * ~DeformableAttentionAggrPluginCreator()
 * getPluginName()
 * getPluginVersion()
 * getFieldNames()
 * createPlugin()
 * deserializePlugin()
 * setPluginNamespace()
 * getPluginNamespace()
 */

完成头文件定义，源文件实现基本都可以复用batchTilePlugin的实现，根据需要更新对应的变量名和使用即可。而主要的工作量，需要我们完成核心函数接口DeformableAttentionAggrPlugin::enqueue()的实现，而加速并行CUDA代码在*.cu实现即可(接口函数：thomas_deform_attn_cuda_forward())，完成这一步可以说大功告成了！

int32_t DeformableAttentionAggrPlugin::enqueue(const nvinfer1::PluginTensorDesc* inputDesc,
                                               const nvinfer1::PluginTensorDesc* outputDesc,
                                               const void* const* inputs,
                                               void* const* outputs,
                                               void* workspace,
                                               cudaStream_t stream) noexcept
{
    int32_t const batch = inputDesc[0].dims.d[0];
    int32_t spatial_size = inputDesc[0].dims.d[1];
    int32_t channels = inputDesc[0].dims.d[2];
    int32_t num_cams = inputDesc[1].dims.d[0];
    int32_t num_levels = inputDesc[1].dims.d[1];
    int32_t num_query = inputDesc[3].dims.d[1];
    int32_t num_point = inputDesc[3].dims.d[2];
    int32_t num_groups = inputDesc[4].dims.d[5];
    int32_t rc = 0;

    const float* value = static_cast<const float*>(inputs[0]);
    const int32_t* spatialShapes = static_cast<const int32_t*>(inputs[1]);
    const int32_t* levelStartIndex = static_cast<const int32_t*>(inputs[2]);
    const float* samplingLoc = static_cast<const float*>(inputs[3]);
    const float* attnWeight = static_cast<const float*>(inputs[4]);

    float* output = static_cast<float*>(outputs[0]);

    rc = thomas_deform_attn_cuda_forward(stream,
                                        value,
                                        spatialShapes,
                                        levelStartIndex,
                                        samplingLoc,
                                        attnWeight,
                                        output,
                                        batch,
                                        spatial_size,
                                        channels,
                                        num_cams,
                                        num_levels,
                                        num_query,
                                        num_point,
                                        num_groups);

    return rc;
}

第四步：DFA Operators PyTorch vs. TensorRT 推理一致性验证:

代码详情见：deploy/dfa_plugin/unit_test/deformable_feature_aggregation_infer-consistency-val_pytorch_vs_trt_unit_test.py。在构建推理一致性代码前，我们需要加载Plugin动态库：这里，我们import ctyp，调用方法为 ctypes.cdll.LoadLibrary :：

ctypes.cdll.LoadLibrary(soFile)
def getPlugin(plugin_name) -> trt.tensorrt.IPluginV2:
    for i, c in enumerate(trt.get_plugin_registry().plugin_creator_list):
        logger.debug(f"We have plugin{i} : {c.name}")
        if c.name == plugin_name:
            return c.create_plugin(c.name, trt.PluginFieldCollection([]))

有了动态库文件，推理链路的开发，需要熟悉指定版本的TensorRT Python API接口的使用方法，不同版本本接口略有不同，v8.4和v8.3基本一致，而v8.5和v8.6基本一致。以v8.4(old)版本为例，举几个常用的方法：

ICuda.Engin.e类类常用方法：

图九：ICudaEngine类的实例对象就是我们反序列化后生成的engine对象

图十：engine.get_binding_dtype获得绑定的tensor类型，它和ONNX输入输出类型保持一致

图十一：engine.get_binding_name获得绑定的tensor名称，它和ONNX输入输出名称保持一致

I EaecutionConteat类常用方法：

图十二：IExecutionContext实例化：context = engine.create_execution_context()

图十三：get_bind_shape 获得execute_v2接口完成模型的推理

CoreCodeBelow:

def inference(
    feature: np.ndarray,
    spatial_shapes: np.ndarray,
    level_start_index: np.ndarray,
    instance_feature: np.ndarray,
    anchor: np.ndarray,
    time_interval: np.ndarray,
    image_wh: np.ndarray,
    lidar2img: np.ndarray,
    engine: str,
    trt_old: bool,
    logger,
):
    bufferH = []
    bufferH.append(feature)
    bufferH.append(spatial_shapes)
    bufferH.append(level_start_index)
    bufferH.append(instance_feature)
    bufferH.append(anchor)
    bufferH.append(time_interval)
    bufferH.append(image_wh)
    bufferH.append(lidar2img)

    if trt_old:
        nIO = engine.num_bindings
        lTensorName = [engine.get_binding_name(i) for i in range(nIO)]
        nInput = sum([engine.binding_is_input(lTensorName[i]) for i in range(nIO)])

        for i in range(nInput, nIO):
            bufferH.append(
                np.zeros(
                    engine.get_binding_shape(lTensorName[i]),
                    dtype=trt.nptype(engine.get_binding_dtype(lTensorName[i])),
                )
            )

        for i in range(nInput):
            logger.debug(
                f"LoadEngine: Input{i}={lTensorName[i]}:\tshape:{engine.get_binding_shape}\ttype:{str(trt.nptype(engine.get_binding_dtype))} ."
            )
        for i in range(nInput, nIO):
            logger.debug(
                f"LoadEngine: Output{i}={lTensorName[i]}:\tshape:{engine.get_binding_shape}\ttype:{str(trt.nptype(engine.get_binding_dtype))} ."
            )

    else:
        nIO = engine.num_io_tensors
        lTensorName = [engine.get_tensor_name(i) for i in range(nIO)]
        nInput = [engine.get_tensor_mode(lTensorName[i]) for i in range(nIO)].count(
            trt.TensorIOMode.INPUT
        )

        context = engine.create_execution_context()
        for i in range(nInput, nIO):
            bufferH.append(
                np.zeros(
                    context.get_tensor_shape(lTensorName[i]),
                    dtype=trt.nptype(engine.get_tensor_dtype(lTensorName[i])),
                )
            )

        for i in range(nInput):
            logger.debug(
                f"LoadEngine: BindingInput{i}={lTensorName[i]} :\tshape:{context.get_tensor_shape(lTensorName[i])},\ttype:{str(trt.nptype(engine.get_tensor_dtype(lTensorName[i])))}"
            )
        for i in range(nInput, nIO):
            logger.debug(
                f"LoadEngine: BindingOutput{i}={lTensorName[i]}:\tshape:{context.get_tensor_shape(lTensorName[i])},\ttype:{str(trt.nptype(engine.get_tensor_dtype(lTensorName[i])))}"
            )

    bufferD = []
    for i in range(nIO):
        bufferD.append(cudart.cudaMalloc(bufferH[i].nbytes)[1])

    for i in range(nInput):
        cudart.cudaMemcpy(
            bufferD[i],
            bufferH[i].ctypes.data,
            bufferH[i].nbytes,
            cudart.cudaMemcpyKind.cudaMemcpyHostToDevice,
        )

    if trt_old:
        binding_addrs = [int(bufferD[i]) for i in range(nIO)]
        context.execute_v2(binding_addrs)
    else:
        for i in range(nIO):
            context.set_tensor_address(lTensorName[i], int(bufferD[i]))
    context.execute_async_v3(0)

    for i in range(nInput, nIO):
        cudart.cudaMemcpy(
            bufferH[i].ctypes.data,
            bufferD[i],
            bufferH[i].nbytes,
            cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost,
        )

    for b in bufferD:
        cudart.cudaFree(b)

    return nInput, nIO, bufferH

评价指标一般使用以下两种方法：

其一：逐元素相减取其中绝对值最大值的方法：max(abs())，下图可以看到只有万分之一的误差;
其二：计算cosine_distance,即高维向量的余弦距离，从结果上看误差在1e-7这个级别;

图十四：DeformableAttentionAggrPlugin: PyTorch vs. TensorRT API 推理一致性验证结果

综上所述，DFA算子完成了PyTorch vs. TensorRT Python API 一致性验证，顺利通过。

第五步：使用trtexec指令参数：--plugins加载Custom Operator Plugin:DeformableAttentionAggrPlugin，完成带有自定义plugin的TensorRT模型转换。

图十五：--plugins支持同时加载多个外部生成的Plugin动态库

trtexec --onnx=head1.onnx \
    --plugins=deploy/dfa_plugin/lib/deformableAttentionAggr.so \
    --memPoolSize=workspace:2048 \
    --saveEngine=head1.engine \
    --verbose \
    --warmUp=200 \
    --iterations=50 \
    --dumpOutput \
    --dumpProfile \
    --dumpLayerInfo \
    --exportOutput=${ENVTRTDIR}/buildOutput_head1.json \
    --exportProfile=${ENVTRTDIR}/buildProfile_head1.json \
    --exportLayerInfo=${ENVTRTDIR}/buildLayerInfo_head1.json \
    --profilingVerbosity=detailed

图十六：TensorRT trtexec转换模型成功flag

二） sparseTrans formerHead转ONNX方案

上面，我们已经解决了Head棘手的Custom Operator:DFA算子如何映射ONNX节点，并注册TensorRT plugin的问题。接下来，基于上面的结论我们直接将转换ONNX中间文件，还是会遇到若干问题。这主要涉及了时序Head数据流向的理解。

我先上结论：我们需要将Head分成两部分转换ONNX：第一帧的head，第二帧及后续帧的head。

第一帧head，输入有8个张量，分别为：

feature : Backbone输出的图像编码特征;
spatial_shapes : Backbone输出的图像编码特征图的尺度w/h;
level_start_index : Backbone输出的图像编码特征图拉平后，起始位置索引;
instance_feature : 当前帧获得的实例特征;
**anchor : 当前帧900个query anchor信息：x,y,z,w,l,h,cos_yaw,sin_yaw,vx,vy,vz; **
time_interval : 默认时间间隔0.5s;
image_wh : Backbone输入图像的w/h;
lidar2img : lidar 投影到图像的转换矩阵;

第一帧head, 输出有4个张量，如下图所示：

图十七：Sparse4D Head 第一帧的ONNX输入输出tensor name

第二帧及后续帧head，输入有12个张量，多出的4个张量分别为：

temp_instance_feature : t-1时刻上一帧获得的实例特征;
temp_anchor : t-1时刻instance_bank缓存的600个query anchor信息：x,y,z,w,l,h,cos_yaw,sin_yaw,vx,vy,vz;
mask : bool类型，t时刻和t-1时刻的时间gap有没有超过最大时间间隔2s;
track_id : instance_bank 缓存的t-1时刻track_id;

P.S.其实trackid和mask可以剥离并不作为模型的输入，本文为了最大程度还原代码逻辑，选择保留在模型中。

第二帧及后续帧head, 输出有5个张量，如下图所示：

图十八：Sparse4D Head 第二帧及后续帧的ONNX输入输出tensor name

Q: sparseTrans formerHead为什么需要分成两部分，依据是什么?

首先Spase4dv3 Head其中不仅包含了模型部分，其实还耦合了InstanceBank部分，InstanceBank主要功能：

head 模型开始推理前，instance bank::get()生成instance_feature，anchor，time_interval，temp_instance_feature，temp_anchor这五个变量;
上述变量会做为head的39个Operator的输入。其中，head 的refine Op模块，会将模型预测的instance_feature, anchor, confidence作为输入，和instance_bank内部维护的历史帧t-1时刻的instance_feature, anchor分别做融合，以更新当前帧预测的instance_feature, anchor;
在39个op执行完后，head将会调用instance::cache(), 缓存当前帧t时刻模型预测的：instance_feature,anchor, confidence;

由TensorRT前向推理支持的是静态图，需要明确的输入输出张量，同时，内部也不支持if判断语句。因此，不难得出，我们必须解偶InstanceBank模块。我们按时间维度将head划分为两部分：第一部分为第一帧的head，输入8个张量，第二部分为第二帧及后续帧的head，输入为12个张量。

第一帧head转ONNX核心代码见：deploy/export_head_onnx.py

@staticmethod
def head_forward(
    self,
    feature,
    spatial_shapes,
    level_start_index,
    instance_feature,
    anchor,
    time_interval,
    image_wh,
    lidar2img,
):

    # Instance bank get inputs
    temp_instance_feature = None
    temp_anchor_embed = None

    # DFA inputs
    metas = {
        "image_wh": image_wh,
        "lidar2img": lidar2img,
    }

    anchor_embed = self.anchor_encoder(anchor)

    feature_maps = [feature, spatial_shapes, level_start_index]
    prediction = []
    for i, op in enumerate(self.operation_order):
        print("i: ", i, "\top: ", op)
        if self.layers[i] is None:
            continue
        elif op == "temp_gnn":
            instance_feature = self.graph_model(
                i,
                instance_feature,
                temp_instance_feature,
                temp_instance_feature,
                query_pos=anchor_embed,
                key_pos=temp_anchor_embed,
            )
        elif op == "gnn":
            instance_feature = self.graph_model(
                i,
                instance_feature,
                value=instance_feature,
                query_pos=anchor_embed,
            )
        elif op == "norm" or op == "ffn":
            instance_feature = self.layers[i](instance_feature)
        elif op == "deformable":
            instance_feature = self.layers[i](
                instance_feature,
                anchor,
                anchor_embed,
                feature_maps,
                metas,
            )
        elif op == "refine":
            anchor, cls, qt = self.layers[i](
                instance_feature,
                anchor,
                anchor_embed,
                time_interval=time_interval,
                return_cls=(
                    len(prediction) == self.num_single_frame_decoder - 1
                    or i == len(self.operation_order) - 1
                ),
            )
            prediction.append(anchor)
            if i != len(self.operation_order) - 1:
                anchor_embed = self.anchor_encoder(anchor)
    return instance_feature, anchor, cls, qt

第二帧及后续帧head转ONNX核心代码见：deploy/export_head_onnx.py

@staticmethod
    def head_forward(
        self,
        feature,
        spatial_shapes,
        level_start_index,
        instance_feature,
        anchor,
        time_interval,
        temp_instance_feature,
        temp_anchor,
        mask,
        track_id,
        image_wh,
        lidar2img,
    ):
        mask = mask.bool()  # TensorRT binding type for bool input is NoneType.
        anchor_embed = self.anchor_encoder(anchor)
        temp_anchor_embed = self.anchor_encoder(temp_anchor)

        # DAF inputs
        metas = {
            "lidar2img": lidar2img,
            "image_wh": image_wh,
        }

        feature_maps = [feature, spatial_shapes, level_start_index]
        prediction = []
        for i, op in enumerate(self.operation_order):
            print("op:  ", op)
            if self.layers[i] is None:
                continue
            elif op == "temp_gnn":
                instance_feature = self.graph_model(
                    i,
                    instance_feature,
                    temp_instance_feature,
                    temp_instance_feature,
                    query_pos=anchor_embed,
                    key_pos=temp_anchor_embed,
                )
            elif op == "gnn":
                instance_feature = self.graph_model(
                    i,
                    instance_feature,
                    value=instance_feature,
                    query_pos=anchor_embed,
                )
            elif op == "norm" or op == "ffn":
                instance_feature = self.layers[i](instance_feature)
            elif op == "deformable":
                instance_feature = self.layers[i](
                    instance_feature,
                    anchor,
                    anchor_embed,
                    feature_maps,
                    metas,
                )
            elif op == "refine":
                anchor, cls, qt = self.layers[i](
                    instance_feature,
                    anchor,
                    anchor_embed,
                    time_interval=time_interval,
                    return_cls=(
                        len(prediction) == self.num_single_frame_decoder - 1
                        or i == len(self.operation_order) - 1
                    ),
                )
                prediction.append(anchor)

                # update in head refine
                if len(prediction) == self.num_single_frame_decoder:
                    N = (
                        self.instance_bank.num_anchor
                        - self.instance_bank.num_temp_instances
                    )
                    cls = cls.max(dim=-1).values
                    _, (selected_feature, selected_anchor) = topk(
                        cls, N, instance_feature, anchor
                    )
                    selected_feature = torch.cat(
                        [temp_instance_feature, selected_feature], dim=1
                    )
                    selected_anchor = torch.cat([temp_anchor, selected_anchor], dim=1)
                    instance_feature = torch.where(
                        mask[:, None, None], selected_feature, instance_feature
                    )
                    anchor = torch.where(mask[:, None, None], selected_anchor, anchor)
                    track_id = torch.where(
                        mask[:, None],
                        track_id,
                        track_id.new_tensor(-1),
                    )

                if i != len(self.operation_order) - 1:
                    anchor_embed = self.anchor_encoder(anchor)
                if len(prediction) > self.num_single_frame_decoder:
                    temp_anchor_embed = anchor_embed[
                        :, : self.instance_bank.num_temp_instances
                    ]
        return instance_feature, anchor, cls, qt, track_id

拆分Head为两部分模型分别转ONNX过程后，我们又遇到一些奇怪的问题，类似下面图示：

图十九：RuntimeError：r INTERNAL ASSERT FAILED, ATen Err

图二十：Failed: Squeeze Op Err

图二十一：UserWarning: The shape inference of prim::Constant type is missing

这里先说结论：

ONNX不支持Silce操作，如：output[..., [X, Y, Z, W, L, H]]，需要转换为output[..., :6];
ONNX不支持if判断语句，PyTorch中的data.squeeze()常常会造成ONNX产生if节点，需转换data.squeeze(-1)为slice操作，如：data[...,0];
ONNX自身是支持bool类型的张量输入，但是，TensorRT是不支持的，在做一致性验证过程保存bin文件也会遇到问题，为了避免bool和int来回转换的烦恼，这里推荐使用int类型作为输入，模型内部转换为bool类型;
PyTorch临时定义的常量tensor一般会造成Warning，所以，我们一般不用关心。但是，如果我们希望ONNX能够tracing这个数据流，那么丢失会造成致命错误，解决方案是将其作为ONNX的输入tensor之一即可;
ONNX输入输出变量需要为torch.tensor类型，类似str, numpy.ndarray, dict类型会直接模型模型转换失败;

1）先说第一条：ONNX支持的Pytorch Ops以及Types，我们可以查看官方仓库：onnx-tensorrt和onnx，一般支持的操作有：data[..., a:b], data[..., a], data[..., a:], data[..., :b], data[:,:, a:b]等，data[..., []]这种操作一般不支持！

图二十二：Slice 支持的操作数据类型

图二十三：Slicez支持的ONNX版本

2）针对第二条：ONNX的tracing机制决定了ONNX会先根据PyTorch数据流，跟踪PyTorch所有的操作函数，而其底层都是由Aten基本操作库排列组合成高级函数的。如果Aten不支持的操作，显然ONNX也无法转换。另外，如果出现if操作，ONNX只能支持其中一条分支数据流跟踪，逻辑上也会损失信息。在本文中，data.squeeze()操作会被ONNX分解为equal和if节点，而if是Aten不支持的操作，因此，ONNX转换失败。

图二十四：ONNX转换失败：if node出现导致

图二十五：ONNX转换成功：ONNX消除了if node

3）针对第三条：第二帧及后续帧Head 模型的其中一个输入张量mask，在PyTorch中输入原本应该是bool类型，这里为了方便转换TensorRT以及做推理一致性验证保存二进制bin文件的便利性，我们使用int类型替换：

# 转ONNX伪造输入张量
dummy_mask = torch.randint(0, 2, size=(bs,)).int().cuda()
# bin文件保存格式
mask = self._mask.int().detach().cpu().numpy()
mask.tofile("mask.bin")

最后，相关问题代码整理如下，共大家参考：

### Head Forward 以下三部分会导致onnx导出错误：

### Part1:SparseBox3DEncoder：
anchor_embed = self.anchor_encoder(anchor)
# 函数内部相关实现如下：
def forward(self, box_3d: torch.Tensor):
    pos_feat = self.pos_fc(box_3d[..., [X, Y, Z]])
    size_feat = self.size_fc(box_3d[..., [W, L, H]])
    yaw_feat = self.yaw_fc(box_3d[..., [SIN_YAW, COS_YAW]])


### Part2:SparseBox3DKeyPointsGenerator：
elif op == "deformable":
    instance_feature = self.layers[i](
        instance_feature,
        anchor,
        anchor_embed,
        feature_maps,
        metas)
# 函数内部相关实现如下：
key_points = key_points + anchor[..., None, [X, Y, Z]]

# Part3:SparseBox3DRefinementModule：
elif op == "refine":
 anchor, cls, qt = self.layers[i](
 instance_feature,
 anchor,
 anchor_embed,
 time_interval=time_interval,
 return_cls=(
 self.training
 or len(prediction) == self.num_single_frame_decoder - 1
 or i == len(self.operation_order) - 1),)
# 函数内部相关实现如下：
self.refine_state = [X, Y, Z, W, L, H]
if self.refine_yaw:
    self.refine_state += [SIN_YAW, COS_YAW]
output[..., self.refine_state] = (
    output[..., self.refine_state] + anchor[..., self.refine_state])

修改更新方案如下：

### Part1: SparseBox3DEncoder修改如下：
SparseBox3DEncoder：
def forward(self, box_3d: torch.Tensor):        
    pos_feat = self.pos_fc(box_3d[..., X:W])
    size_feat = self.size_fc(box_3d[..., W:SIN_YAW])
    yaw_feat = self.yaw_fc(box_3d[..., SIN_YAW:VX])

### Part2: SparseBox3DKeyPointsGenerator修改如下：
key_points = key_points + anchor[..., None, :3]  # deploy friendly

### Part3: SparseBox3DRefinementModule修改如下：
if self.refine_yaw:
    output[..., :SIN_YAW] = output[..., :SIN_YAW] + anchor[..., :SIN_YAW]
else:
    output[..., :VX] = output[..., :VX] + anchor[..., :VX]

三) sparseTrans formerHead Plorch vs.TensorRT us.Polygraphy推理一致性验证

首先，我们构建带有plugin:DeformableAttentionAggrPlugin的first frame Head端到端推理逻辑脚本，详情见：deploy/unit_test/sparse4d_head_first_frame_infer-consistency-val_pytorch_vs_trt_unit_test.py。下面两张图展示：PyTorch vs. TensorRT的误差max(abs()) 和cosine_distance非常小，都在期望的结果内（误差1e-3级别）。实验结果再次验证了上述模型转换逻辑以及注册Plugin逻辑正确性。

图二十六：sparse transformer head first frame PyTorch推理结果

图二十七：sparse transformer head first frame TensorRT推理结果

图二十八：sparse transformer head first frame Polygraphy推理结果

最后我们使用Polygraphy工具快速验证下基于TensorRT python API搭建的推理链路是否可靠：

polygraphy run deploy/onnx/sparse4dhead1st.onnx --trt --verbose --load-inputs=deploy/utils/first_frame_head_inputs.json --trt-outputs mark all --save-results=deploy/utils/first_frame_head_outputs.json  --plugins deploy/dfa_plugin/lib/deformableAttentionAggr.so

如上图:我们关注的4个维度的指标:Max, Min, SumAbs 以及前五个 first5、后五个元素last5肉眼可见的无差异，也和预期的结果保持一致。Good!到这里,first frame head部署已经结束了，关于sceond frame 以及后续帧head部署逻辑类似，这里就不展开了。

Attention: 其中，Polygraphy输入JSON格式文件：deploy/utils/first_frame_head_inputs.json、生成的结果JSON格式文件：first_frame_head_outputs.json，保存和解析都需使用Polygraphy指定的函数接口，调用方式见我的脚本：SparseEnd2End/deploy/utils/polygraphy_save_json_parser.py

CoreCodeBelow：

# save json
save_json(
    [
        {
            "feature": feature,
            "spatial_shapes": spatial_shapes,
            "level_start_index": level_start_index,
            "instance_feature": instance_feature,
            "anchor": anchor,
            "time_interval": time_interval,
            "image_wh": image_wh,
            "lidar2img": lidar2img,
        },
    ],
    input_json_path,
)

# parser json
onnx_outputs, trt_outputs = info_onnx["lst"][0][1][0], info_trt["lst"][0][1][0]
onnx_layers_outputs, trt_layers_outputs = (
    onnx_outputs["outputs"],
    trt_outputs["outputs"],
)
trouble_layers, ok_layers = [], []
for layer, value in onnx_layers_outputs.items():
    if layer in trt_layers_outputs.keys():
        onnx_out = pjson.from_json(json.dumps(value)).arr
        trt_out = pjson.from_json(json.dumps(value)).arr
        print(np.size(onnx_out), np.size(trt_out), layer)

图二十九：polygraphy.json.from_json()/save_json()使用说明

最后补充下Polygraphy命令行指令使用方法，参见官方文档：Polygraphy Docs。常用的参数如下：

--validate : 快速验证ONNX转engine推理结果是否会出现Nan或者Inf
--trt : 启动TensorRT后端推理
--onnxrt : 启动onnxruntime后端推理
--plugins : 加载动态库
--load-inputs : 加载指定输入数据推理
--atol : 指定绝对容忍误差
--rtol : 指定相对容忍误差
--fp16 : 指定精度fp16
save-engine : 指定engine保存位置，你没看错，这个指令也可以转模型

图三十：Polygraphy Docs --validate 解释

SparseImgBackbone Deployment Pipeline

imgBackbone部署属于常规操作，没有特别需要注意的问题，毕竟Sparse4Dv3 imgBackbone的组成为：Resnet50+FPN，大家再熟悉不过了。这里，我随机测试三个样本，下图直接贴上验证结果：

图三十一：PyTorch vs. ONNX Runtime 推理一致性验证结果

图三十二：PyTorch vs. TensorRT API 推理一致性验证结果

图三十三：imgBackbone polygraphy验证通过

imgBackbone推理使用了ONNX Runtime后端，ONNX Runtime 调用函数如下：

def onnx_infer(
    onnx_model,
    dummy_img,
):

    session = ort.InferenceSession(onnx_model.SerializeToString())
    ort_inputs = {session.get_inputs()[0].name: dummy_img}
    ort_outs = session.run(["feature"], ort_inputs)
    return ort_outs[0]

Summary

本文主要关注ONNX | ONNX Runtime | TensorRT | Polygraphy在模型部署中的应用，归纳总结了常用的方法、实用指令和部署过程中踩过的实坑解决经验。最后，本文串联完成部署过程的各个子环节、闭环验证了部署模型推理的一致性。本人技术有限，如有理解上的误区，望不吝赐教。

Portal

ONNX支持的算子及版本对应关系：

github.com/onnx/onnx/blob/main/docs/Operators.md

ONNX支持的算子操作类型：docs/operators.md

github.com/onnx/onnx-tensorrt/tree/10.4-GA

TensorRT docs, 这里介绍了TensorRT python API, TensorRT C++ API和Polygraphy使用方法:

docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#overview

ONNX Runtime API: https://onnxruntime.ai/docs/api/

Polygraphy docs: Polygraphy - Polygraphy 0.49.13 documentationAPI DocsPolygraphy docs: Polygraphy - Polygraphy 0.49.13 documentation

TensorRT python API docx: https://developer.nvidia.com/docs/drive/drive-os/6.0.6/public/drive-os-tensorrt/api-reference/docs/python/index.html

Q&A

下面是我收集的有关算法细节的部分问题及回答，希望对大家有用哈～

Q: 你好，我看代码加了一个衰减参数保证Temporal的anchor能够尽可能传到下一帧，但是是怎么保证在下一帧中它能够继续传递，而不是由下一帧的top 300个anchor中产生?因为非temporal的部分也是正常产生所有的检测，取topk后也是基本上是当帧所有的检测，temp_gnn是如何保证最后输出的结果是从temporal的600个而不是当前top k的300个anchor产生?我的理解是通过与temporal anchor做cross attention达到隐式去除？但这一部分并没有做一些监督，是如何做到的？
A: 这是一个优化问题，temporal anchor的质量远优于currrent anchor，训练中temporal instance匹配上的概率就会远大于current instance，使得它能稳定跟踪上。而且temporal feature和current feature具备显著差异，所以网络能很轻松的判断哪些是temporal传过来的。cross attention提供了一部分temporal的信息，但不是主要原因。

Q: I would like to extract BEV lanes from images, and I was thinking about include a 2D lane detector like YOLO-pv2 along with your model. I was wondering whether you have some transformation matrices that can be exploited to extract 3D/BEV information from lane segmentation on the image. I would like to detect lanes onto the image, and then project them into BEV/3D space if possible？A: To project from 2D to 3D, you need to know the pixel depth, and then utilize the formulaT_cam2lidar * T_img2cam * [u * d, v * d, d], where[u,v]are pixel coordinates,T_img2camis the inverse of the intrinsic matrix, anddis the depth. By the way, with sparse4d, the inverse projection sampling method is employed. This involves first setting 3D points and then projecting them onto 2D sampled features, rather than projecting 2D onto 3D in a direct projection.

Q: I found you set theinstance_featuregradient toFalsein thecode. Is that intentional? If it is, may I know what's the reason? Is that because you always want the instance features initialized to be zeros? I also saw a few other places that you set the gradient toFalse, for instance in theKeyPointGeneratorand just to confirm with you, the reason for this False is because it is fix-scale key points, right? A: The input instance feature actually has no physical meaning, and it will not be retained in the computation graph during tracing. The computation graph retains only the key points from the first layer, so we consider the gradient of this instance feature to be unimportant. Disabling the gradient offix_scalefrom our original design intention to fix key points, ensuring that these points can always sample features.

参考

算法架构分析：https://zhuanlan.zhihu.com/p/637096473
Sparse4Dv2: https://arxiv.org/pdf/2305.14018
High Bandwidth Memory: https://en.wikipedia.org/wiki/High_Bandwidth_Memory
PyTorch compute grid sample: https://pytorch.org/docs/stable/generated/torch.nn.functional.grid_sample.html
Pytorch C++ API: https://pytorch.org/cppdocs/
TensorRT OSS: https://github.com/NVIDIA/TensorRT

『自动驾驶之心知识星球』欢迎加入交流！重磅，自动驾驶之心科研论文辅导来啦，申博、CCF系列、SCI、EI、毕业论文、比赛辅导等多个方向，欢迎联系我们！

① 全网独家视频课程

端到端自动驾驶、仿真测试、自动驾驶C++、BEV感知、BEV模型部署、BEV目标跟踪、毫米波雷达视觉融合、多传感器标定、多传感器融合、多模态3D目标检测、车道线检测、轨迹预测、在线高精地图、世界模型、点云3D目标检测、目标跟踪、Occupancy、CUDA与TensorRT模型部署、大模型与自动驾驶、NeRF、语义分割、自动驾驶仿真、传感器部署、决策规划、轨迹预测等多个方向学习视频（扫码即可学习）

网页端官网：www.zdjszx.com

② 国内首个自动驾驶学习社区

国内外最大最专业，近4000人的交流社区，已得到大多数自动驾驶公司的认可！涉及30+自动驾驶技术栈学习路线，从0到一带你入门自动驾驶感知（端到端自动驾驶、世界模型、仿真闭环、2D/3D检测、语义分割、车道线、BEV感知、Occupancy、多传感器融合、多传感器标定、目标跟踪）、自动驾驶定位建图（SLAM、高精地图、局部在线地图）、自动驾驶规划控制/轨迹预测等领域技术方案、大模型，更有行业动态和岗位发布！欢迎扫描下方二维码，加入自动驾驶之心知识星球，这是一个真正有干货的地方，与领域大佬交流入门、学习、工作、跳槽上的各类难题，日常分享论文+代码+视频