Sparse4dv3 论文学习（Ⅱ）方法部分

m0_60857098

于 2024-08-30 21:47:09 发布

阅读量445

点赞数 7

分类专栏： Sparse4D 文章标签：学习

本文链接：https://blog.csdn.net/m0_60857098/article/details/141691644

版权

Sparse4D 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

接上一篇https://blog.csdn.net/m0_60857098?spm=1011.2423.3001.5343

3 Methodology

The network structure and inference pipeline is depicted in Figure 1, mirroring that of Sparse4Dv2 [ 27 ]. In this section, we will first introduce two auxiliary tasks: Temporal Instance Denoising (Sec 3.1) and Quality Estimation (Sec 3.2). Following this, we present a straightforward enhancement to the attention module, termed decoupled attention (Sec 3.3). Finally, we outline how to leverage Sparse4D to achieve 3D MOT (Sec 3.4).

网络结构和推理管道如图1所示，反映了Sparse4Dv2的网络结构[27]。在本节中，我们将首先介绍两个辅助任务：时态实例去噪（第3.1节）和质量估计（第3.2节）。在此之后，我们对注意力模块进行了直接的增强，称为解耦注意力（第3.3节）。最后，我们概述了如何利用Sparse4D实现3D MOT（第3.4节）。

3.1 Temporal Instance Denoising

In 2D detection, introducing a denoising task proves to be an effective approach for improving both model convergence stability and detection performance. In this paper, we extend the fundamental 2D single-frame denoising to 3D temporal denoising. Within Sparse4D, instances (referred to as queries) are decoupled into implicit instance features and explicit anchors. During the training process, we initialize two sets of anchors. One set comprises anchors uniformly distributed in the detection space, initialized using the k-means method, and these anchors serve as learnable parameters. The other set of anchors is generated by adding noise to ground truth (GT), as illustrated in Equation (1,2), specifically tailored for 3D detection tasks.

Here, ZX represents the set of integers between 1 and X. N denotes the number of GT, whileM represents the group number of noising instances. In this context, ∆A signifies random noise, where ∆Ai,j,1 and ∆Ai,j,2 follow uniform random distributions within the range (−x, x) and(−2x, −x) ∪ (x, 2x), respectively. In DINO-DETR [48 ], which categorizes samples generated by∆Ai,j,1 as positive and those from ∆Ai,j,2 as negative, there is a potential risk of mis-assignment, as∆Ai,j,2 may be closer to the ground truth. To entirely mitigate any ambiguity, we employ bipartite graph matching for each group of Anoise and Agt to determine positive and negative samples.

这里，ZX表示1到X之间的整数集。N表示GT的数量，而M表示有噪声实例的组数。在这种情况下，∆A表示随机噪声，其中\8710t Ai，j，1和\8710》Ai，j，2分别在（-x，x）和（-2x，-x）∪（x，2x）范围内遵循均匀随机分布。在DINO-DETR[48]中，将由∆Ai，j，1生成的样本归类为阳性，将由？Ai，j，2生成的样本分类为阴性，存在误分配的潜在风险，因为\8710》Ai，i，2可能更接近实际情况。为了完全减轻任何歧义，我们对每组Anoise和Agt采用二分图匹配来确定阳性和阴性样本。

Furthermore, we extend the aforementioned single-frame noisy instances through temporal propagation to better align with the sparse recurrent training process. During each frame's training, we randomly select M ′ groups from the noisy instances to project onto the next frame. The temporal propagation strategy aligns with that of non-noisy instances—anchors undergo ego pose and velocity compensation, while instance features serve as direct initializations for the features of the subsequent frame.

此外，我们通过时间传播扩展了上述单帧噪声实例，以更好地与稀疏递归训练过程对齐。在每一帧的训练过程中，我们从有噪声的实例中随机选择M′组投影到下一帧。时间传播策略与无噪声实例的传播策略一致——锚点进行自我姿态和速度补偿，而实例特征则作为后续帧特征的直接初始化。

It's important to note that we maintain the mutual independence of each group of instances, and no feature interaction occurs between noisy instances and normal instances. This is different from DN-DETR [ 17 ], as shown in Figure 4(b). This approach ensures that within each group, a ground truth is matched to at most one positive sample, effectively avoiding any potential ambiguity.

值得注意的是，我们保持了每组实例的相互独立性，噪声实例和正常实例之间不会发生特征交互。这与DN-DETR[17]不同，如图4（b）所示。这种方法确保在每个组内，一个基本事实最多与一个阳性样本相匹配，有效地避免了任何潜在的歧义。

3.2 Quality Estimation

Existing sparse-based methods primarily estimate classification confidence for positive and negative samples to measure alignment with ground truth. The optimization goal is to maximize the classification confidence of all positive samples. However, there is significant variation in matching quality among different positive samples. Consequently, classification confidence is not an ideal metric for evaluating the quality of predicted bounding boxes. To facilitate the network in understanding the quality of positive samples, accelerating convergence on one hand and rationalizing the prediction ranking on the other, we introduce the task of prediction quality estimation. For the 3D detection task, we define two quality metrics: centerness and yawness, with the following formulas.

现有的基于稀疏的方法主要估计正样本和负样本的分类置信度，以测量与地面真实值的对齐情况。优化目标是最大化所有阳性样本的分类置信度。然而，不同阳性样本之间的匹配质量存在显著差异。因此，分类置信度不是评估预测边界框质量的理想指标。为了便于网络了解正样本的质量，一方面加速收敛，另一方面使预测排名合理化，我们引入了预测质量估计的任务。对于3D检测任务，我们使用以下公式定义了两个质量指标：中心度和哈欠度。

While network outputs classification confidence, it also estimates centerness and yawness. Their respective loss functions are defined as cross-entropy loss and focal loss [24], as depicted in the following equation.

在网络输出分类置信度的同时，它还估计了中心度和哈欠度。它们各自的损失函数被定义为交叉熵损失和焦损[24]，如下式所示。

3.3 Decoupled Attention

As mentioned in the introduction, we make simple improvements to the anchor encoder, self-attention, and temporal cross-attention in Sparse4Dv2. The architecture is illustrated in Figure 5. The design principle is to combine features from different modalities in a concatenated manner, as opposed to using an additive approach. There are some differences compared to Conditional DETR [ 33 ]. Firstly, we make improvements in the attention between of queries instead of the cross-attention between query and image features; the cross-attention still utilizes deformable aggregation from Sparse4D. Additionally, instead of concatenating position embedding and query feature at the single-head attention level, we make modifications externally at the multi-head attention level, providing the neural network with greater flexibility.

正如引言中提到的，我们对Sparse4Dv2中的锚编码器、自注意和时间交叉注意进行了简单的改进。该架构如图5所示。设计原则是以级联的方式组合不同模态的特征，而不是使用加法方法。与条件DETR[33]相比，存在一些差异。首先，我们改进了查询之间的注意力，而不是查询和图像特征之间的交叉注意力；交叉注意力仍然利用来自Sparse4D的可变形聚合。此外，我们没有在单头注意力级别将位置嵌入和查询特征连接起来，而是在多头注意力级别进行外部修改，为神经网络提供了更大的灵活性。

3.4 Extend to Tracking

In the framework of Sparse4Dv2, the temporal modeling adopts a recurrent form, projecting instances from the previous frame onto the current frame as input. The temporal instances are similar to the tracking queries in query-based trackers, with the distinction that the tracking queries are constrained by a higher threshold, representing highly confident detection results. In contrast, our temporal instances are numerous, and most of them may not accurately represent detected objects in previous frames.

在Sparse4Dv2框架中，时间建模采用循环形式，将前一帧的实例投影到当前帧上作为输入。时间实例类似于基于查询的跟踪器中的跟踪查询，区别在于跟踪查询受到更高阈值的约束，表示高度自信的检测结果。相比之下，我们的时间实例很多，其中大多数可能无法准确表示先前帧中检测到的对象。

To extend from detection to multi-object tracking within the Sparse4Dv2 framework, we directly redefine the instance from a detection bounding box to a trajectory. A trajectory includes an ID and bounding boxes for each frame.Due to the setting of a large number of redundant instances, many instances may not be associated with a precise target and do not be assigned a definite ID. Nevertheless, they can still be propagated to the next frame. Once the detection confidence of an instance surpasses the threshold T , it is considered to be locked onto a target and assigned an ID, which remains unchanged throughout temporal propagation. Therefore, achieving multi-object tracking is as simple as applying an ID assignment process to the output perception results. The lifecycle management during tracking is seamlessly handled by the top-k strategy in Sparse4Dv2, requiring no additional modifications. Specifics can be referred to in Algorithm 1. In our experiments, we observe that the trained temporal model demonstrates excellent tracking characteristics without the need for fine-tuning with tracking constraints.

为了在Sparse4Dv2框架内从检测扩展到多目标跟踪，我们直接将实例从检测边界框重新定义为轨迹。轨迹包括每个帧的ID和边界框。由于设置了大量冗余实例，许多实例可能没有与精确的目标相关联，也没有分配明确的ID。然而，它们仍然可以传播到下一帧。一旦实例的检测置信度超过阈值T，则认为它被锁定在目标上并分配了一个ID，该ID在整个时间传播过程中保持不变。因此，实现多目标跟踪就像对输出感知结果应用ID分配过程一样简单。跟踪期间的生命周期管理由Sparse4Dv2中的top-k策略无缝处理，不需要额外的修改。具体可参考算法1。在我们的实验中，我们观察到训练好的时间模型表现出了出色的跟踪特性，而不需要对跟踪约束进行微调。