Far3D: Expanding the Horizon for Surround-view 3D Object Detection 论文翻译

Far3D: Expanding the Horizon for Surround-view 3D Object Detection
论文翻译,有遗落、错误处烦请指正,博主会尽快修改。

Xiaohui Jiang ∗1† Shuailin Li ∗2 Yingfei Liu2 Shihao Wang1† Fan Jia2 Tiancai Wang2 Lijin Han1 Xiangyu Zhang2

论文地址:https://arxiv.org/pdf/2308.09616.pdf

0. Abstract

Recently, 3D object detection from surround-view images has made notable advancements with its low deployment cost.

最近,环视图像的 3D 对象检测因其低部署成本而取得了显著进展。

However, most works have primarily focused on close perception range while leaving long-range detection less explored.

然而,大部分研究主要集中近距离感知范围,而远距离检测较少被探索。

Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence.

现有方法直接扩展以覆盖远距离带来了重计算成本不稳定收敛**等挑战。

To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D.

为了解决这些限制,本文提出了一种新颖的基于稀疏查询的****框架,称为 Far3D

By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries.

通过利用高质量2D 对象先验知识,我们生成了补充 3D 全局查询3D 自适应查询

To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module.

为了有效捕获不同视图和尺度下远距离对象判别特征,我们引入了一个感知视角的聚合模块

Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks.

此外,我们提出了一种范围调制的 3D 去噪方法解决****查询错误传播减轻****远距离任务中的收敛问题

Significantly, Far3D demonstrates SoTA (State of the Art) performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches.

值得注意的是,Far3D 在挑战性的 Argoverse 2 数据集上展示了 SoTA(最先进)性能,覆盖了 150 米的广泛范围,超过了几种基于 LiDAR 的方法。

Meanwhile, Far3D exhibits superior performance compared to previous methods on the nuScenes dataset.

与此同时,Far3D 在 nuScenes 数据集上的性能相比于先前的方法显示出优越性。

The code will be available soon.

代码即将发布。

1.0 Introduction

1 Introduction

1 引言

1.1

3D object detection plays an important role in understanding 3D scenes for autonomous driving, aiming to provide accurate object localization and category around the ego vehicle.

三维物体检测在理解自动驾驶的三维场景中扮演着重要角色,其目的是为自车周围提供精确的物体定位和分类。

Surround-view methods (Huang and Huang 2022; Li et al. 2023; Liu et al. 2022b; Li et al. 2022c; Yang et al. 2023; Park et al. 2022; Wang et al. 2023a), with their advantages of low cost and wide applicability, have achieved remarkable progress.

环视方法(Huang and Huang 2022; Li et al. 2023; Liu et al. 2022b; Li et al. 2022c; Yang et al. 2023; Park et al. 2022; Wang et al. 2023a)以其低成本和广泛的适用性优势,取得了显著的进展。

However, most of them focus on close-range perception (e.g., ∼50 meters on nuScenes (Caesar et al. 2020)), leaving the long-range detection field less explored.

然而,它们中的大多数集中于近距离感知(例如,nuScenes (Caesar et al. 2020)上的约 50 米),导致远距离检测领域的研究不足。

Detecting distant objects is essential for real-world driving to maintain a safe distance, especially at high speeds or complex road conditions.

检测远处的物体对于现实世界的驾驶至关重要,以保持安全距离,尤其是在高速或复杂道路条件下。

在这里插入图片描述

  • Figure 1
    Figure 1: Performance comparisons on Argoverse 2 between 3D detection and 2D detection. (a) and (b) demonstrate predicted boxes of StreamPETR and YOLOX, respectively.
    图 1:Argoverse 2 数据集上 3D 检测与 2D 检测的性能比较。(a)和(b)分别展示了 StreamPETR 和 YOLOX 的预测框。

    © implies that 2D recall is notably better than 3D recall and can act as a bridge to achieve high-quality 3D detection.
    ©表明 2D 的召回率明显优于 3D 的召回率,并且可以作为实现高质量 3D 检测的桥梁。

    Note that 2D recall does not represent 3D upper bound due to different recall criteria.
    需要注意的是,由于召回标准不同,2D 召回率并不代表 3D 召回率的上限。

1.2

Existing surround-view methods can be broadly categorized into two groups based on the intermediate representation, dense Bird’s-Eye-View (BEV) based methods and sparse query-based methods.

现有的环视方法大致可以基于中间表示分为两大类,即基于密集鸟瞰图(BEV)的方法和基于稀疏查询的方法。

BEV based methods (Huang et al. 2021; Huang and Huang 2022; Li et al. 2023, 2022c; Yang et al. 2023) usually convert perspective features to BEV features by employing a view transformer (Philion and Fidler 2020), then utilizing a 3D detector head to produce the 3D bounding boxes.

基于 BEV 的方法(Huang et al. 2021; Huang and Huang 2022; Li et al. 2023, 2022c; Yang et al. 2023)通常通过使用视图变换器(Philion and Fidler 2020)将透视特征转换为 BEV 特征,然后利用 3D 检测器头来生成 3D 边界框。

However, dense BEV features come at the cost of high computation even for the close-range perception, making it more difficult to scale up to long-range scenarios.

然而,即便是对于近距离感知,密集 BEV 特征也需要高计算成本,这使得其难以扩展到远距离场景。

Instead, sparse query-based methods (Wang et al. 2022; Liu et al. 2022a,b; Wang et al. 2023a) intend to learn 3D global object queries from the representative training data and generate detection results by aggregating image features following DETR (Carion et al. 2020) style.

相反,基于稀疏查询的方法(Wang et al. 2022; Liu et al. 2022a,b; Wang et al. 2023a)旨在从具有代表性的训练数据中学习 3D 全局物体查询,并通过聚合图像特征来生成检测结果,这一过程遵循 DETR(Carion et al. 2020)的风格。

Although sparse design can avoid the squared growth of query numbers, its global fixed queries cannot adapt to dynamic scenarios and usually miss targets in long-range detection.

尽管稀疏设计可以避免查询数量的平方增长,但其全局固定查询无法适应动态场景,并且通常会在远距离检测中漏掉目标。

We adopt the sparse query design to maintain detection efficiency and introduce 3D adaptive queries to address the inflexibility weaknesses.

我们采用稀疏查询设计以保持检测效率,并引入 3D 自适应查询来解决不灵活性的弱点。

1.3

To employ the sparse query-based paradigm for long-range detection, the primary challenge lies in poor recall performance.

为了采用基于稀疏查询的范式进行远程检测,主要挑战在于召回率表现不佳。

Due to the query sparsity in 3D space, assignments between predictions and ground-truth objects are affected, generating only a small amount of matched positive queries.

由于 3D 空间中的查询稀疏性,预测和真实对象之间的匹配受到影响,仅生成少量匹配的正面查询。

As illustrated in Fig. 1, 3D detector recalls are pretty low, yet recalls from the existing 2D detector are much higher, showing a significant performance gap between them.

如图 1 所示,3D 探测器的召回率相当低,而现有的 2D 探测器的召回率则高得多,显示出它们之间有显著的性能差距。

Motivated by this, leveraging high-quality 2D object priors to improve 3D proposals is a promising approach, for enabling accurate localization and comprehensive coverage.

由此启发,利用高质量的 2D 物体先验信息来改进 3D 提案是一种很有前景的方法,能够实现精确定位和全面覆盖。

Although previous methods like SimMOD (Zhang et al. 2023) and MV2D (Wang et al. 2023b) have explored using 2D predictions to initialize 3D object proposals, they primarily focus on close-range tasks and discard learnable object queries.

尽管像 SimMOD(张等,2023)和 MV2D(王等,2023b)这样的先前方法探索了使用 2D 预测来初始化 3D 物体提案,但它们主要关注近距离任务,并放弃了可学习的物体查询。

在这里插入图片描述

  • Figure 2
    Figure 2: Different cases of transforming 2D points into 3D space. The blue dots indicate the centers of 3D objects in images.
    图 2:将 2D 点转换到 3D 空间的不同情况。蓝点表示图像中 3D 对象的中心。

    (a) shows the redundant prediction with the wrong depth, which is in yellow.
    (a)显示了错误深度的多余预测,用黄色表示。

    (b) illustrates the error propagation problem dominated by different ranges.
    (b)展示了由不同范围主导的误差传播问题。

Moreover, as depicted in Fig. 2, directly introducing 3D queries derived from 2D proposals for long-range tasks encounters two issues: 1) inferior redundant predictions due to uncertain depth distribution along the object rays, and 2) larger deviations in 3D space as the range increases due to frustum transformation.

此外,如图 2 所示,直接将从 2D 提案派生的 3D 查询用于远程任务会遇到两个问题:1)由于物体射线上深度分布的不确定性导致次优的冗余预测;以及 2)由于棱锥变换,随着范围增加在 3D 空间中出现更大的偏差。

These noisy queries can impact the training stability, requiring effective denoising ways to optimize.

这些噪声查询可能影响训练的稳定性,需要有效的去噪方法来优化。

Furthermore, within the training process, the model exhibits a tendency to overfit on densely populated close objects while disregarding sparsely distributed distant objects.

此外,在训练过程中,模型倾向于对密集分布的近处物体过拟合,而忽视稀疏分布的远处物体。

1.4

To address the aforementioned challenges, we design a novel 3D detection paradigm to expand the perception horizon.

为了解决上述挑战,我们设计了一种新型的 3D 检测范式以扩展感知视野。

Despite the 3D global query that was learned from the dataset, our approach also incorporates auxiliary 2D proposals into 3D adaptive query generation.

尽管从数据集中学到了 3D 全局查询,我们的方法还结合了辅助的 2D 提案到 3D 自适应查询生成中。

Specifically, we first produce reliable pairs of 2D object proposals and corresponding depths then project them to 3D proposals via spatial transformation.

具体来说,我们首先生成可靠的 2D 物体提案和相应深度的配对,然后通过空间转换将它们投影到 3D 提案中。

We compose 3D adaptive queries with the projected positional embedding and semantic context, which would be refined in the subsequent decoder.

我们使用投影的位置嵌入和语义上下文来构建 3D 自适应查询,这些将在随后的解码器中进行精细化。

In the decoder layers, perspective-aware aggregation is employed across different image scales and views.

在解码器层中,采用了透视感知聚合,以处理不同图像尺度和视角。

It learns sampling offsets for each query and dynamically enables interactions with favorable features.

它学习每个查询的采样偏移,并动态地使其与有利特征进行交互。

For instance, distant object queries are beneficial to attend large-resolution features, while the opposite is better for close objects in order to capture high-level context.

例如,远处物体查询有利于关注大分辨率特征,而相反的则更适合近处物体以捕捉高层次上下文。

Lastly, we design a range-modulated 3D denoising technique to mitigate query error propagation and slow convergence.

最后,我们设计了一种范围调制的 3D 去噪技术,以减轻查询误差传播和缓慢的收敛问题。

Considering the different regression difficulties for various ranges, noisy queries are constructed based on ground-truth (GT) as well as referring to their distances and scales.

考虑到不同范围的回归难度不同,噪声查询是基于实际情况(GT)构建的,同时也参考了它们的距离和尺度。

Our method feeds multi-group noisy proposals around GT into the decoder and trains the model to a) recover 3D GT for positive ones and b) reject negative ones, respectively.

我们的方法将多组围绕 GT 的噪声提案输入到解码器,并训练模型分别进行 a) 恢复正面 3D GT 以及 b) 拒绝负面提案。

The inclusion of query denoising also alleviates the problem of range-level unbalanced distribution.

查询去噪的引入也减轻了范围级别不平衡分布的问题。

1.5

Our proposed method achieves remarkable performance advancements over state-of-the-art (SoTA) approaches in the challenging long-range Argoverse 2 dataset, as well as surpassing the prior arts of LiDAR-based methods.

我们提出的方法在具有挑战性的长距离 Argoverse 2 数据集上取得了显著的性能进步,超越了现有的基于激光雷达的方法。

To evaluate the generalization capability, we further validate its results on the nuScenes dataset and demonstrate SoTA metrics.

为了评估其泛化能力,我们进一步在 nuScenes 数据集上验证了其结果,并展示了最先进的性能指标。

In summary, our contributions are:

总结来说,我们的贡献包括:

  • We propose a novel sparse query-based framework to expand the perception range in 3D detection, by incorporating high-quality 2D object priors into 3D adaptive queries.
  • 我们提出了一个新颖的基于稀疏查询的框架,通过将高质量的 2D 物体先验纳入 3D 自适应查询中来扩展 3D 检测的感知范围。
  • We develop perspective-aware aggregation that captures informative features from diverse scales and views, as well as a range-modulated 3D denoising technique to address query error propagation and convergence problems.
  • 我们开发了透视感知聚合来捕捉来自不同尺度和视图的信息特征,并提出了范围调制的 3D 去噪技术来解决查询误差传播和收敛问题。
  • On the challenging long-range Argoverse 2 datasets, our method surpasses surround-view methods and outperforms several LiDAR-based methods. The generalization of our method is validated on the nuScenes dataset.
  • 在具有挑战性的长距离 Argoverse 2 数据集上,我们的方法超越了环视方法,并且比多个基于激光雷达的方法表现更好。我们的方法在 nuScenes 数据集上的泛化能力得到了验证。

2.0 Related Work

Surround-view 3D Object Detection

2.1 全景视角 3D 对象检测

Recently, 3D object detection from surround-view images has attracted much attention and achieved great progress, due to its advantages of low deployment cost and rich semantic information.

最近,由于低部署成本和丰富的语义信息优势,全景视角图像的 3D 对象检测引起了广泛关注,并取得了显著进展。

Based on feature representation, existing methods (Wang et al. 2021, 2022; Liu et al. 2022a; Huang and Huang 2022; Li et al. 2023, 2022b; Jiang et al. 2023; Liu et al. 2022b; Li et al. 2022c; Yang et al. 2023; Park et al. 2022; Wang et al. 2023a; Zong et al. 2023; Liu et al. 2023) can be largely classified into BEV-based methods and sparse-query based methods.

根据特征表示,现有方法(Wang et al. 2021, 2022; Liu et al. 2022a; Huang and Huang 2022; Li et al. 2023, 2022b; Jiang et al. 2023; Liu et al. 2022b; Li et al. 2022c; Yang et al. 2023; Park et al. 2022; Wang et al. 2023a; Zong et al. 2023; Liu et al. 2023)主要可以分为基于 BEV 的方法和基于稀疏查询的方法。

Extracting image features from surround views, BEV-based methods (Huang et al. 2021; Huang and Huang 2022; Li et al. 2023, 2022c) transform features into BEV space by leveraging estimated depths or attention layers, then a 3D detector head is employed to predict localization and other properties of 3D objects.

从全景视角提取图像特征,基于 BEV 的方法(Huang 等人 2021;Huang 和 Huang 2022;Li 等人 2023, 2022c)通过利用估计的深度或注意力层将特征转换到 BEV 空间,然后使用 3D 检测器头部来预测 3D 对象的定位和其他属性。

For instance, BEVFormer (Li et al. 2022c) leverages both spatial and temporal features by interacting with spatial and temporal space through predefined grid-shaped BEV queries.

例如,BEVFormer(Li 等人 2022c)通过与预定义的网格形状 BEV 查询交互,利用空间和时间特征。

BEVDepth (Li et al. 2023) proposes a 3D detector with a trustworthy depth estimation, by introducing a camera-aware depth estimation module.

BEVDepth(Li 等人 2023)提出了一个具有可靠深度估计的 3D 检测器,方法是引入了一个感知相机的深度估计模块。

On the other hand, sparse query-based paradigms (Wang et al. 2022; Liu et al. 2022a) learn global object queries from the representative data, then feed them into the decoder to predict 3D bounding boxes during inference.

另一方面,基于稀疏查询的范式(Wang 等人 2022;Liu 等人 2022a)从代表性数据中学习全球对象查询,然后将它们输入到解码器中,在推理过程中预测 3D 边界框。

This line of work has the advantage of lightweight computing.

这一系列工作的优势在于计算量轻。

在这里插入图片描述

  • Figure3
    Figure 3: The overview of our proposed Far3D. Feeding surround-view images into the backbone and FPN neck, we obtain 2D image features and encode them with camera parameters for perspective-aware transformation.
    图 3:我们提出的 Far3D 概览。将全景视图图像输入到主干网络和 FPN 颈部,我们获得 2D 图像特征,并用摄像机参数对其进行编码,以进行感知透视的转换。
    Utilizing a 2D detector and DepthNet, we generate reliable 2D box proposals and their corresponding depths, which are then concatenated and projected into 3D space.
    利用 2D 检测器和 DepthNet,我们生成可靠的 2D 框提议及其对应的深度,然后将它们连接起来并投影到 3D 空间中。
    The generated 3D adaptive queries, combined with the initial 3D global queries, are iteratively refined by the decoder layers to predict 3D bounding boxes.
    生成的 3D 自适应查询与初始的 3D 全局查询结合起来,通过解码器层进行迭代细化,以预测 3D 边界框。
    Furthermore, temporal modeling is equipped through long-term query propagation.
    此外,通过长期查询传播实现了时间建模。

Furthermore, temporal modeling for surround-view 3D detection can improve detection performance and decrease velocity errors significantly, and many works (Huang and Huang 2022; Liu et al. 2022b; Park et al. 2022; Wang et al. 2023a; Lin et al. 2022, 2023) aim to extend a single-frame framework to multi-frame design.

此外,环视 3D 检测的时间建模可以显著提高检测性能并大幅减少速度误差,许多研究(Huang 和 Huang 2022; Liu 等人 2022b; Park 等人 2022; Wang 等人 2023a; Lin 等人 2022, 2023)旨在将单帧框架扩展到多帧设计。

BEVDet4D (Huang and Huang 2022) lifts the BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space, via fusing features with the previous frame.

BEVDet4D(Huang 和 Huang 2022)通过与前一帧融合特征,将 BEVDet 范式从仅空间的 3D 空间提升到时空四维空间。

PETRv2 (Liu et al. 2022b) extends the 3D position embedding in PETR for temporal modeling through the temporal alignment of different frames.

PETRv2(Liu 等人 2022b)通过不同帧的时间对齐,扩展了 PETR 中的 3D 位置嵌入进行时间建模。

However, they use only limited history.

然而,他们只使用了有限的历史信息。

To leverage both short-term and long-term history, SOLOFusion (Park et al. 2022) balances the impacts of spatial resolution and temporal difference on localization potential, then use it to design a powerful temporal 3D detector.

为了充分利用短期和长期历史,SOLOFusion(Park 等人 2022)平衡了空间分辨率和时间差异对定位潜力的影响,然后使用它来设计一个强大的时间 3D 检测器。

StreamPETR (Wang et al. 2023a) develops an object-centric temporal mechanism in an online manner, where long-term historical information is propagated through object queries.

StreamPETR(Wang 等人 2023a)以在线方式开发了一个以物体为中心的时间机制,其中通过物体查询传播长期历史信息。

2D Auxiliary Tasks for 3D Detection

3D detection from surround-view images can be improved through 2D auxiliary tasks, and some works (Xie et al. 2022; Zhang et al. 2023; Wang, Jiang, and Li 2022; Yang et al. 2023; Wang et al. 2023b) aim to exploit its potential.

环视图像的 3D 检测可以通过 2D 辅助任务得到提升,一些研究(Xie 等人 2022; Zhang 等人 2023; Wang, Jiang, 和 Li 2022; Yang 等人 2023; Wang 等人 2023b)旨在开发其潜力。

There are several approaches including 2D pretraining, auxiliary supervision, and proposal generation.

其中的方法包括 2D 预训练、辅助监督和提议生成。

SimMOD (Zhang et al. 2023) exploits sample-wise object proposals and designs a two-stage training manner, where perspective object proposals are generated and followed by iterative refinement in DETR3D-style.

SimMOD(Zhang 等人 2023)利用样本级的物体提议,并设计了一个两阶段的训练方式,生成透视物体提议,然后以 DETR3D 风格进行迭代细化。

Focal-PETR (Wang, Jiang, and Li 2022) performs 2D object supervision to adaptively focus the attention of 3D queries on discriminative foreground regions.

Focal-PETR(Wang, Jiang, 和 Li 2022)执行 2D 物体监督,以适应性地聚焦 3D 查询对区分性前景区域的注意力。

BEVFormerV2 (Yang et al. 2023) presents a two-stage BEV detector where perspective proposals are fed into the BEV head for final predictions.

BEVFormerV2(Yang 等人 2023)提出了一个两阶段的 BEV 检测器,将透视提议送入 BEV 头部以进行最终预测。

MV2D (Wang et al. 2023b) designs a 3D detector head that is initialized by RoI regions of 2D predicted proposals.

MV2D(Wang 等人 2023b)设计了一个 3D 检测器头部,它由 2D 预测提议的 RoI 区域初始化。

Compared to the above methods, our framework differs in the following aspects.

与上述方法相比,我们的框架在以下几个方面有所不同。

Firstly, we aim to resolve the challenges of long-range detection with surrounding views, which are less explored in previous methods.

首先,我们旨在解决环绕视图下长距离检测的挑战,这在以前的方法中探索较少。

Besides learning 3D global queries, we explicitly leverage 2D predicted boxes and depths to build 3D adaptive queries, utilizing positional prior and semantic context simultaneously.

除了学习 3D 全局查询外,我们明确利用 2D 预测框和深度信息构建 3D 自适应查询,同时利用位置先验和语义上下文。

Furthermore, the designs of perspective-aware aggregation and 3D denoising are integrated to address task issues.

此外,集成了透视感知聚合和 3D 去噪设计来解决任务问题。

3.0 Method

3 方法

Overview

3.1 概览

Fig. 3 shows the overall pipeline of our sparse query-based framework.

图 3 展示了我们基于稀疏查询的框架的整体流程。

Feeding surround-view images I = {I1, …, In}, we extract multi-level images features F = {F1, …, Fn} by using the backbone network (e.g. ResNet, ViT) and a FPN (Lin et al. 2017) neck.

输入环绕视图图像 I = { I 1 , . . . , I n } I = \{{I^1, ..., I^n\}} I={I1,...,In},我们使用主干网络(例如 ResNet、ViT)和一个 FPN(Lin 等人,2017 年)颈部提取多级图像特征 F = { F 1 , . . . , F n } F = \{{F^1, ..., F^n\}} F={F1,...,Fn}

To generate 3D adaptive queries, we first obtain 2D proposals and depths using a 2D detector head and depth network, then filter reliable ones and transform them into 3D space to generate 3D object queries.

为了生成 3D 自适应查询,我们首先使用 2D 检测器头部和深度网络获得 2D 提案和深度,然后筛选可靠的提案,并将它们转换到 3D 空间中以生成 3D 物体查询。

In this way, informative object priors from 2D detections are encoded into the 3D adaptive queries.

通过这种方式,2D 检测中的信息丰富的物体先验被编码进 3D 自适应查询中。

In the 3D detector head, we concatenate 3D adaptive queries and 3D global queries, then input them to transformer decoder layers including self-attention among queries and perspective-aware aggregation between queries and features.

在 3D 检测器头部,我们连接 3D 自适应查询和 3D 全局查询,然后将它们输入到变压器解码层,包括查询之间的自我关注和查询与特征之间的透视感知聚合。

We propose perspective-aware aggregation to efficiently capture rich features in multiple views and scales by considering the projection of 3D objects.

我们提出透视感知聚合,通过考虑 3D 物体的投影来高效捕捉多个视图和比例中的丰富特征。

Besides, range-modulated 3D denoising is introduced to alleviate query error propagation and stabilize the convergence, when training with long-range and imbalanced distributed objects.

此外,引入范围调制的 3D 去噪来减轻查询错误传播和稳定收敛性,用于长距离和分布不均的物体的训练。

Sec 3.4 depicts the denoising technique in detail.

第 3.4 节详细描述了去噪技术。

Adaptive Query Generation

3.2 自适应查询生成

Directly extending existing 3D detectors from short range (e.g., ~50m) to long range (e.g., ~150m) suffers from several problems: heavy computation costs, inefficient convergence and declining localization ability.

直接将现有的 3D 检测器从短距离(例如,约 50 米)扩展到远距离(例如,约 150 米)会遇到几个问题:计算成本高,收敛效率低,以及定位能力下降。

For instance, the query number is supposed to grow at least squarely to cover possible objects in a larger range, yet such a computing disaster is unacceptable in realistic scenarios.

例如,查询数量至少要平方增长以覆盖更大范围内的可能物体,但这样的计算灾难在现实场景中是不可接受的。

Besides that, small and sparse distant objects would hinder the convergence and even hurt the localization of close objects.

除此之外,小而稀疏的远距离物体会阻碍收敛,甚至损害近处物体的定位。

Motivated by the high performance of 2D proposals, we propose to generate adaptive queries as objects prior to assist 3D localization.

受到 2D 提案高性能的激励,我们提出生成自适应查询作为物体的先验来协助 3D 定位。

This paradigm compensates for the weakness of global fixed query design and allows the detector to generate adaptive queries near the ground-truth (GT) boxes for different images.

这种模式弥补了全局固定查询设计的弱点,并允许检测器为不同图像生成接近真值(GT)框的自适应查询。

In this way, the model is equipped with better generalization and practicality.

通过这种方式,模型具备了更好的泛化性和实用性。

Specifically, given image features after FPN neck, we feed them into the anchor-free detector head from YOLOX (Ge et al. 2021) and a light-weighted depth estimation net, outputting 2D box coordinates, scores and depth map.

具体来说,在获得 FPN 颈部后的图像特征后,我们将它们输入到 YOLOX(Ge 等人,2021 年)的无锚点检测器头和一个轻量级的深度估计网络,输出 2D 框坐标、分数和深度图。

2D detector head follows the original design, while the depth estimation is regarded as a classification task by discretizing the depth into bins (Reading et al. 2021; Zhang et al. 2022).

2D 检测器头遵循原始设计,而深度估计被视为通过将深度离散化到不同的箱子(Reading 等人,2021 年;Zhang 等人,2022 年)的分类任务。

We then make pairs of 2D boxes and corresponding depths.

然后,我们将 2D 框和对应的深度成对匹配。

To avoid the interference of low-quality proposals, we set a score threshold τ (e.g., 0.1) to leave only reliable ones.

为了避免低质量提议的干扰,我们设定了一个分数阈值 τ(例如,0.1),以保留只有可靠的提议。

For each view i i i, box centers ( c w , c h ) (c_w, c_h) (cw,ch)from 2D predictions and depth d w h d_{wh} dwh from depth map are combined and projected to 3D proposal centers c 3 d c_{3d} c3d.

对于每一个视图 i i i,我们将来自 2D 预测的框中心 ( c w , c h ) (c_w, c_h) (cw,ch)和深度图的深度 d w h d_{wh} dwh结合起来,并投影到3D提案中心 c 3 d c_{3d} c3d.

c 3 d = K i − 1 I i − 1 [ c w ∗ d w h , c h ∗ d w h , d w h , 1 ] T c_{3d} = K_i^{-1} I_i^{-1} [c_w * d_{wh}, c_h * d_{wh}, d_{wh}, 1]^T c3d=Ki1Ii1[cwdwh,chdwh,dwh,1]T (1)

where $K_i, I_i $denote camera extrinsic and intrinsic matrices.

其中$K_i, I_i $表示相机的外部和内部矩阵。

After obtaining projected 3D proposals, we encode them into 3D adaptive queries as follows,

获得投影的 3D 提案后,我们将它们编码为 3D 自适应查询,如下所示,

Q p o s = P o s E m b e d ( c 3 d ) Q_{pos} = PosEmbed(c_{3d}) Qpos=PosEmbed(c3d) (2)

Q s e m = S e m E m b e d ( z 2 d , s 2 d ) Q_{sem} = SemEmbed(z_{2d}, s_{2d}) Qsem=SemEmbed(z2d,s2d) (3)

Q = Q p o s + Q s e m Q = Q_{pos} + Q_{sem} Q=Qpos+Qsem (4)

where Q p o s , Q s e m Q_{pos}, Q_{sem} Qpos,Qsem denote positional embedding and semantic embedding, respectively.

其中 Q p o s , Q s e m Q_{pos}, Q_{sem} Qpos,Qsem 分别表示位置嵌入和语义嵌入。

Z 2 d Z_{2d} Z2d sampled from F F Fcorresponds to the semantic context of position ( c w , c h ) (c_w, c_h) (cw,ch), and s 2 d s_{2d} s2dis the confidence score of 2D boxes.

Z 2 d Z_{2d} Z2d F F F中采样,对应于位置 ( c w , c h ) (c_w, c_h) (cw,ch)的语义上下文, s 2 d s_{2d} s2d是 2D 框的置信度评分。

P o s E m b e d ( ⋅ ) PosEmbed(\cdot) PosEmbed() consists of a sinusoidal transformation (Vaswani et al. 2017) and a MLP, while S e m E m b e d ( ⋅ ) SemEmbed(\cdot) SemEmbed() is another MLP.

P o s E m b e d ( ⋅ ) PosEmbed(\cdot) PosEmbed() 包含一个正弦变换(Vaswani et al. 2017)和一个多层感知机(MLP),而 S e m E m b e d ( ⋅ ) SemEmbed(\cdot) SemEmbed() 是另一个多层感知机。

Perspective-aware Aggregation

3.3 视角感知聚合

Existing sparse query-based approaches usually adopt one single-level feature map for computation effectiveness (e.g. StreamPETR).

现有的基于稀疏查询的方法通常采用单一层次的特征图以提高计算效率(例如 StreamPETR)。

However, the single feature level is not optimal for all object queries of different ranges.

然而,单一特征层次并不适用于所有不同范围的对象查询。

For example, small distant objects require large-resolution features for precise localization, while high-level features are better suited for large close objects.

例如,小而远的物体需要高分辨率的特征以实现精确定位,而高层次的特征更适合于大型近距离物体。

To overcome the limitation, we propose perspective-aware aggregation, enabling efficient feature interactions on different scales and views.

为了克服这一限制,我们提出了视角感知聚合方法,实现不同尺度和视角下的高效特征交互。

Inspired by the deformable attention mechanism (Zhu et al. 2020), we apply a 3D spatial deformable attention consisting of 3D offsets sampling followed by view transformation.

受到可变形注意力机制的启发(Zhu et al. 2020),我们应用了一种 3D 空间可变形注意力机制,包括 3D 偏移采样和视图变换。

Formally, we first equip image features F with the camera information including intrinsic I and extrinsic parameters K.

在形式上,我们首先为图像特征 F 配备相机信息,包括内参 I 和外参 K。

A squeeze-and-excitation block (Hu, Shen, and Sun 2018) is used to explicitly enrich the features.

一个挤压激励模块(Hu, Shen, and Sun 2018)被用来明确增强特征。

Given enhanced feature F′, we employ 3D deformable attention instead of global attention in PETR series (Liu et al. 2022a,b; Wang et al. 2023a).

在给定增强特征 F’的情况下,我们采用 3D 可变形注意力而不是 PETR 系列中的全局注意力(Liu et al. 2022a,b; Wang et al. 2023a)。

For each query reference point in 3D space, the model learns M sampling offsets around and projects these references into different 2D scales and views.

对于 3D 空间中的每一个查询参考点,模型学习 M 个采样偏移量,并将这些参考点投影到不同的 2D 尺度和视图上。

P q 2 d = I ⋅ K ⋅ ( P q 3 d + Δ P q 3 d ) P^{2d}_q = I \cdot K \cdot (P^{3d}_q + \Delta P^{3d}_q) Pq2d=IK(Pq3d+ΔPq3d) (5)

where P q 3 d P^{3d}_q Pq3d, Δ P q 3 d \Delta P^{3d}_q ΔPq3dare 3D reference point and learned offsets for query q, respectively.

其中, P q 3 d P^{3d}_q Pq3d, Δ P q 3 d \Delta P^{3d}_q ΔPq3d分别是查询 q 的 3D 参考点和学习得到的偏移量。

P q 2 d P^{2d}_q Pq2d stands for the projected 2d reference point of different scales and views. For simplicity, we omit the subscripts of scales and views.

P q 2 d P^{2d}_q Pq2d代表不同尺度和视角的投影 2d 参考点。为了简化,我们省略了尺度和视角的下标。

Next, 3D object queries interact with multi-scale sampled features from ( F’ ), according to the above 2D reference points P q 2 d P^{2d}_q Pq2d

接下来,3D 对象查询根据上述 2D 参考点 P q 2 d P^{2d}_q Pq2d与来自 F ′ F' F的多尺度采样特征进行交互。

In this way, diverse features from various vis and scales are aggregated into 3D queries by considering their relative importance.

通过这种方式,将不同视角和尺度的多样化特征根据它们相对的重要性聚合到 3D 查询中。

Range-modulated 3D Denoising

3D 对象查询在不同的距离会有不同的回归难度,这与通常对现有的 2D 去噪方法如 DN-DETR(Li et al. 2022a)平等对待的 2D 查询不同。

范围调制 3D 去噪

3D 物体查询在不同距离的回归难度不同,这与现有的 2D 去噪方法(如 DN-DETR,李等人,2022a)通常平等对待的 2D 查询是不同的。

The difficulty discrepancy comes from query density and error propagation.

难度的差异来源于查询的密度和错误的传播。

On the one hand, queries corresponding to distant objects are less matched compared to close ones.

一方面,与近处的物体相比,对应于远处物体的查询匹配度较低。

On the other hand, small errors of 2D proposals can be amplified when introducing 2D priors to 3D adaptive queries, illustrated in Fig. 2, not to mention which effect increases along with object distance.

另一方面,当将 2D 先验引入到 3D 自适应查询中时,2D 提案的小错误可能会被放大,如图 2 所示,更不用说这种效果会随着物体距离的增加而增加。

As a result, some query proposals near GT boxes can be regarded as noisy candidates, whereas others with notable deviation should be negative ones.

因此,一些接近 GT 框的查询提案可以被视为噪声候选,而其他偏差明显的应该被视为负面的。

Therefore we aim to recall those potential positive ones and directly reject solid negative ones, by developing a method called range-modulated 3D denoising.

因此,我们的目标是通过开发一种称为范围调制 3D 去噪的方法,回收那些潜在的正面查询并直接拒绝确定的负面查询。

Concretely, we construct noisy queries based on GT objects by simultaneously adding positive and negative groups.

具体来说,我们通过同时添加正面和负面组,基于 GT 物体构建噪声查询。

For both types, random noises are applied according to object positions and sizes to facilitate denoising learning in long-range perception.

对于这两种类型,根据物体的位置和大小应用随机噪声,以便于在长距离感知中进行去噪学习。

Formally, we define the position of noisy queries as:

正式地,我们将噪声查询的位置定义为:

P ^ = P G T + α f p ( S G T ) + ( 1 − α ) f n ( P G T ) \hat{P} = P_{GT} + \alpha f_p(S_{GT}) + (1 - \alpha) f_n(P_{GT}) P^=PGT+αfp(SGT)+(1α)fn(PGT) (公式 6)

where α ∈ { 0 , 1 } \alpha \in \{0,1\} α{0,1} corresponds to the generation of negative and positive queries, respectively.

这里的 α ∈ { 0 , 1 } \alpha \in \{0,1\} α{0,1} 分别对应于负面和正面查询的生成。

P G T , S G T ∈ R 3 P_{GT}, S_{GT} \in \mathbb{R}^3 PGT,SGTR3 represents 3D center ( x , y , z ) (x, y, z) (x,y,z) and box scale ( w , l , h ) (w, l, h) (w,l,h) of GT, and P ^ \hat{P} P^ is noisy coordinates.

P G T , S G T ∈ R 3 P_{GT}, S_{GT} \in \mathbb{R}^3 PGT,SGTR3 代表了地面真实(GT)的 3D 中心 ( x , y , z ) (x, y, z) (x,y,z) 和盒子尺寸 ( w , l , h ) (w, l, h) (w,l,h) ,而 P ^ \hat{P} P^是噪声坐标。

We use functions f p f_p fp and f n f_n fn to encode position-aware noise for positive and negative samples.

我们使用函数 f p f_p fp f n f_n fn 为正面和负面样本编码位置感知的噪声。

For positive noisy samples, we set f p ( S G T ) f_p(S_{GT}) fp(SGT) as a linear function of 3D box scale with a random variable.

对于正面噪声样本,我们将 f p ( S G T ) f_p(S_{GT}) fp(SGT) 设定为一个与 3D 盒子尺度相关的线性函数,该函数含有一个随机变量。

We incorporate the offset constraint within GT boxes to guide the model in accurately reconstructing the GT from positive queries, while ensuring clear distinction from surrounding adjacent boxes.

我们在 GT 盒子内部纳入偏移约束,以指导模型准确地从正面查询中重构 GT,同时确保与周围相邻盒子有清晰的区分。

For negative samples, the offsets are supposed to be relevant to their position range, thus we propose several implementations.

对于负面样本,偏移量应与它们的位置范围相关,因此我们提出了几种实现方法。

For some examples, f n ( P G T ) f_n(P_{GT}) fn(PGT)can be in forms of log ⁡ ( P G T ) \log(P_{GT}) log(PGT), λ 2 P G T \lambda^2P_{GT} λ2PGT or P G T \sqrt{P_{GT}} PGT .

例如, f n ( P G T ) f_n(P_{GT}) fn(PGT)可以是 log ⁡ ( P G T ) \log(P_{GT}) log(PGT) λ 2 P G T \lambda^2P_{GT} λ2PGT 或者 P G T \sqrt{P_{GT}} PGT 的形式。

We show these attempts in Sec. 4.4.

我们将这些尝试展示在第 4.4 节中。

Moreover, multi-group samples are generated for each GT object to enhance query diversity.

此外,为了增强查询的多样性,我们为每个 GT 对象生成了多组样本。

Each group comprises one positive sample and K K K negative samples.

每组包括一个正面样本和 K K K个负面样本。

This approach serves as an imitation of noisy positive candidates and false positive candidates during training.

这种方法在训练期间作为噪声正面候选者和假正面候选者的模仿。

  • 15
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值