Sparse4dv3 论文学习(Ⅰ)摘要引言相关工作

Abstract

In autonomous driving perception systems, 3D detection and tracking are the two fundamental tasks. This paper delves deeper into this field, building upon the Sparse4D framework. We introduce two auxiliary training tasks (Temporal Instance Denoising and Quality Estimation) and propose decoupled attention to make structural improvements, leading to significant enhancements in detection performance. Additionally, we extend the detector into a tracker using a straightforward approach that assigns instance ID during inference, further highlighting the advantages of query-based algorithms. Extensive experiments conducted on the nuScenes benchmark validate the effectiveness of the proposed improvements. With ResNet50 as the backbone, we witnessed enhancements of 3.0%, 2.2%, and 7.6% in mAP, NDS, and AMOTA, achieving 46.9%, 56.1%, and 49.0%, respectively. Our best model achieved 71.9% NDS and 67.7% AMOTA on the nuScenes test set. Code will be released at https://github.com/linxuewu/Sparse4D.

在自动驾驶感知系统中,3D检测和跟踪是两项基本任务。本文在Sparse4D框架的基础上,对这一领域进行了更深入的研究。我们引入了两个辅助训练任务(时间实例去噪和质量估计),并提出了解耦注意力进行结构改进,从而显著提高了检测性能。此外,我们使用一种在推理过程中分配实例ID的简单方法将检测器扩展为跟踪器,进一步突出了基于查询的算法的优势。在nuScenes基准上进行的广泛实验验证了所提出改进的有效性。以ResNet50为骨干,我们观察到mAP、NDS和AMOTA的增强率分别为3.0%、2.2%和7.6%,分别达到46.9%、56.1%和49.0%。我们的最佳模型在nuScenes测试集上实现了71.9%的NDS和67.7%的AMOTA。代码将于发布https://github.com/linxuewu/Sparse4D.

1 Introduction

In the field of temporal multi-view perception research, sparse-based algorithms have seen significant advancements [41 , 6, 5, 43 , 26 , 27 ], reaching perception performance comparable to dense-BEVbased algorithms [21 , 13 , 11 , 19 , 18, 35, 44, 8] while offering several advantages: 1) Free view transform. These sparse methods eliminate the need for converting image space to 3D vector space. 2) Constant computational load in detection head, which is irrelevant to perception distance and image resolution. 3) Easier implementation of integrating downstream tasks by end-to-end manner. In this study, we select the sparse-based algorithm Sparse4Dv2 [ 26, 27 ] as our baseline for implementing improvements. The overall structure of the algorithm is illustrated in Figure 1. The image encoder transforms multi-view images into multi-scale feature maps, while the decoder blocks leverage these image features to refine instances and generate perception outcomes.

在时间多视图感知研究领域,基于稀疏的算法取得了重大进展[41,6,5,43,26,27],达到了与基于密集边界元的算法[21,13,11,19,18,35,44,8]相当的感知性能,同时提供了几个优点:1)自由视图变换。这些稀疏方法消除了将图像空间转换为3D向量空间的需要。2) 检测头中的计算负载恒定,与感知距离和图像分辨率无关。3) 通过端到端的方式更容易实现下游任务的集成。在这项研究中,我们选择基于稀疏的算法Sparse4Dv2[26,27]作为实施改进的基线。算法的整体结构如图1所示。图像编码器将多视图图像转换为多尺度特征图,而解码器块利用这些图像特征来细化实例并生成感知结果。

To begin with, we observe that sparse-based algorithms encounter greater challenges in convergence compared to dense-based counterparts, ultimately impacting their final performance. This issue has been thoroughly investigated in the realm of 2D detection [17 , 48, 53], and is primarily attributed to the use of a one-to-one positive sample matching. This matching approach is unstable during the initial stages of training, and also results in a limited number of positive samples compared to one-to-many matching, thus reducing the efficiency of decoder training. Moreover, Sparse4D utilizes sparse feature sampling instead of global cross-attention, which further hampers encoder convergence due to the scarce positive samples. In Sparse4Dv2 [ 27 ], dense depth supervision has been introduced to partially mitigate these convergence issues faced by the image encoder. This paper primarily aims at enhancing model performance by focusing on the stability of decoder training. We incorporate the denoising task as auxiliary supervision and extend denoising techniques from 2D single-frame detection to 3D temporal detection. It not only ensures stable positive sample matching but also significantly increases the quantity of positive samples. Moreover, we introduce the task of quality estimation as auxiliary supervision. This renders the output confidence scores more reasonable, refining the accuracy of detection result ranking and, resulting in higher evaluation metrics. Furthermore, we enhance the structure of the instance self-attention and temporal cross-attention modules in Sparse4D, introducing a decoupled attention mechanism designed to reduce feature interference during the calculation of attention weights. As depicted in Figure 3, when the anchor embedding and instance feature are added as the input for attention calculation, there are instances of outlier values in the resulting attention weights. This fails to accurately reflect the inter-correlation among target features, leading to an inability to aggregate the correct features. By replacing addition with concatenation, we significantly mitigate the occurrence of this incorrect phenomenon. This enhancement shares similarities with Conditional DETR [33]. However, the crucial difference lies in our emphasis on attention among queries, as opposed to Conditional DETR, which concentrates on cross-attention between queries and image features. Additionally, our approach involves a distinct encoding methodology.

首先,我们观察到,与基于密集的算法相比,基于稀疏的算法在收敛方面遇到了更大的挑战,最终影响了它们的最终性能。这个问题在二维检测领域已经得到了彻底的研究[17,48,53],主要归因于使用了一对一的正样本匹配。这种匹配方法在训练的初始阶段是不稳定的,与一对多匹配相比,也会导致阳性样本数量有限,从而降低解码器训练的效率。此外,Sparse4D利用稀疏特征采样而不是全局交叉注意,由于正样本稀缺,这进一步阻碍了编码器的收敛。在Sparse4Dv2[27]中,引入了密集深度监督,以部分缓解图像编码器面临的这些收敛问题。本文主要通过关注解码器训练的稳定性来提高模型性能。我们将去噪任务作为辅助监督,并将去噪技术从2D单帧检测扩展到3D时间检测。它不仅确保了稳定的阳性样本匹配,而且显著增加了阳性样本的数量。此外,我们引入了质量评估作为辅助监督的任务。这使得输出置信度得分更加合理,提高了检测结果排名的准确性,从而产生了更高的评估指标。此外,我们增强了Sparse4D中实例自注意和时间交叉注意模块的结构,引入了一种解耦注意机制,旨在减少注意权重计算过程中的特征干扰。如图3所示,当添加锚点嵌入和实例特征作为注意力计算的输入时,结果注意力权重中会出现异常值。这无法准确反映目标特征之间的相互关联,导致无法聚合正确的特征。通过用连接代替加法,我们大大减少了这种不正确现象的发生。这种增强与条件DETR[33]有相似之处。然而,关键的区别在于我们强调查询之间的注意力,而不是条件DETR,后者专注于查询和图像特征之间的交叉注意力。此外,我们的方法涉及一种独特的编码方法。

Finally, to advance the end-to-end capabilities of the perception system, we explore the integration of 3D multi-object tracking task into the Sparse4D framework, enabling the direct output of object motion trajectories. Unlike tracking-by-detection methods, we eliminate the need for data association and filtering, integrating all tracking functionalities into the detector. Moreover, distinct from existing joint detection and tracking methods, our tracker requires no modification to the training process or loss functions. It does not necessitate providing ground truth IDs, yet achieves predefined instanceto-tracking regression. Our tracking implementation maximally integrates the detector and tracker, requiring no modifications to the training process of the detector and no additional fine-tuning. Our contributions can be summarized as follows:

(1) We propose Sparse4D-v3, a potent 3D perception framework with three effective strategies: temporal instance denoising, quality estimation and decoupled attention.

(2) We extend Sparse4D into an end-to-end tracking model.

(3) We demonstrate the effectiveness of our improvements on nuScenes, achieving state-of-the-art performance in both detection and tracking tasks.

最后,为了提高感知系统的端到端能力,我们探索了将3D多目标跟踪任务集成到Sparse4D框架中,从而能够直接输出目标运动轨迹。与检测跟踪方法不同,我们消除了数据关联和过滤的需要,将所有跟踪功能集成到检测器中。此外,与现有的联合检测和跟踪方法不同,我们的跟踪器不需要修改训练过程或损失函数。它不需要提供地面实况ID,但可以实现预定义的实例来跟踪回归。我们的跟踪实现最大限度地集成了检测器和跟踪器,不需要对检测器的训练过程进行修改,也不需要进行额外的微调。我们的贡献可以总结如下:

(1) 我们提出了Sparse4D-v3,这是一个强大的3D感知框架,具有三种有效的策略:时间实例去噪、质量估计和解耦注意力。

(2) 我们将Sparse4D扩展为端到端跟踪模型。

(3) 我们展示了我们在nuScenes上的改进的有效性,在检测和跟踪任务中实现了最先进的性能。

2 Related Works

2.1 Improvements for End-to-End Detection

DETR [ 3 ] lerverages the Transformer architecture [ 38 ], along with a one-to-one matching training approach, to eliminate the need for NMS and achieve end-to-end detection. DETR has led to a series of subsequent improvements. Deformable DETR [51] change global attention into local attention based on reference points, significantly narrowing the model's training search space and enhancing convergence speed. It also reduces the computational complexity of attention, facilitating the use of high-resolution inputs and multi-scale features within DETR's framework. Conditional-DETR [ 33 ] introduces conditional cross-attention, separating content and spatial information in the query and independently calculating attention weights through dot products, thereby accelerating model convergence. Building upon Conditional-DETR, Anchor-DETR[42 ] explicitly initializes reference points, serving as anchors. DAB-DETR [ 28] further includes bounding box dimensions into the initialization of anchors and the encoding of spatial queries. Moreover, many methods aim to enhance the convergence stability and detection performance of DETR from the perspective of training matching. DN-DETR [ 17 ] encodes ground truth with added noise as query input to the decoder, employing a denoising task for auxiliary supervision. Building upon DN-DETR, DINO [ 48 ] introduces noisy negative samples and proposes the use of Mixed Query Selection for query initialization, further improving the performance of the DETR framework. Group-DETR [4 ] replicates queries into multiple groups during training, providing more training samples. Co-DETR [53 ] incorporates dense heads during training, serving a dual purpose. It enables more comprehensive training of the backbone and enhances the training of the decoder by using the dense head output as a query.

DETR[3]对Transformer架构[38]进行了平均,并采用了一对一的匹配训练方法,以消除对NMS的需求,实现端到端的检测。DETR带来了一系列后续改进。可变形DETR[51]基于参考点将全局注意力转化为局部注意力,显著缩小了模型的训练搜索空间,提高了收敛速度。它还降低了注意力的计算复杂性,促进了DETR框架内高分辨率输入和多尺度特征的使用。条件DETR[33]引入了条件交叉注意力,将查询中的内容和空间信息分离,并通过点积独立计算注意力权重,从而加速了模型收敛。基于条件DETR,Anchor DETR[42]显式初始化参考点,作为锚点。DAB-DETR[28]还将边界框维度纳入锚的初始化和空间查询的编码中。此外,许多方法旨在从训练匹配的角度提高DETR的收敛稳定性和检测性能。DN-DETR[17]使用添加的噪声对地面实况进行编码,作为解码器的查询输入,并采用去噪任务进行辅助监督。在DN-DETR的基础上,DINO[48]引入了噪声负样本,并提出使用混合查询选择进行查询初始化,进一步提高了DETR框架的性能。组DETR[4]在训练过程中将查询复制到多个组中,提供了更多的训练样本。Co DETR[53]在训练中结合了密集的头部,具有双重目的。它能够更全面地训练骨干网,并通过使用密集的头部输出作为查询来增强解码器的训练。

DETR3D [41] applies deformable attention to multi-view 3D detection, achieving end-to-end 3D detection with spatial feature fusion. The PETR series [29 , 30, 39] introduce 3D position encoding, leveraging global attention for direct multi-view feature fusion and conducting temporal optimization. The Sparse4D series [ 26 , 27 ] enhance DETR3D in aspects like instance feature decoupling, multipoint feature sampling, temporal fusion, resulting in enhanced perceptual performance.

DETR3D[41]将可变形注意力应用于多视图3D检测,通过空间特征融合实现端到端的3D检测。PETR系列[29,30,39]引入了3D位置编码,利用全局注意力进行直接多视图特征融合,并进行时间优化。Sparse4D系列[26,27]在实例特征解耦、多点特征采样、时间融合等方面增强了DETR3D,从而提高了感知性能。

2.2 Multi-Object Track

Most multi-object tracking (MOT) methods use the tracking-by-detection framework. They rely on detector outputs to perform post-processing tasks like data association and trajectory filtering, leading to a complex pipeline with numerous hyperparameters that need tuning. These approaches do not fully leverage the capabilities of neural networks. To integrate the tracking functionality directly into the detector, GCNet [25], TransTrack [37 ] and TrackFormer [ 32 ] utilize the DETR framework. They implement inter-frame transfer of detected targets based on track queries, significantly reducing post-processing reliance. MOTR [47] advances tracking to a fully end-to-end process. MOTRv3 [ 46 ] addresses the limitations in detection query training of MOTR, resulting in a substantial improvement in tracking performance. MUTR3D [ 49 ] applies this query-based tracking framework to the field of 3D multi-object tracking. These end-to-end tracking methods share some common characteristics: (1) During training, they constrain matching based on tracking objectives, ensuring consistent ID matching for tracking queries and matching only new targets for detection queries. (2) They use a high threshold to transmit temporal features, passing only high-confidence queries to the next frame. Our approach diverges from existing methods. We don't need to modify detector training or inference strategies, nor do we require ground truth for tracking IDs.

大多数多目标跟踪(MOT)方法使用逐检测跟踪框架。它们依赖于检测器输出来执行数据关联和轨迹过滤等后处理任务,从而产生一个包含许多需要调整的超参数的复杂管道。这些方法没有充分利用神经网络的能力。为了将跟踪功能直接集成到检测器中,GCNet[25]、TransTrack[37]和TrackFormer[32]利用了DETR框架。它们基于跟踪查询实现了检测目标的帧间传输,大大降低了对后处理的依赖。MOTR[47]将跟踪推进到一个完全端到端的过程。MOTRv3[46]解决了MOTR检测查询训练的局限性,从而大大提高了跟踪性能。MUTR3D[49]将这种基于查询的跟踪框架应用于3D多目标跟踪领域。这些端到端跟踪方法有一些共同的特点:(1)在训练过程中,它们根据跟踪目标约束匹配,确保跟踪查询的一致ID匹配,并仅匹配检测查询的新目标。(2) 它们使用高阈值来传输时间特征,只将高置信度查询传递给下一帧。我们的方法与现有方法不同。我们不需要修改检测器训练或推理策略,也不需要跟踪ID的地面实况

  • 17
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值