MIT-BEVFusion系列四-1:BEVFusion:Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View论文中英文对照(前两章)

本文提出BEVFusion,一种打破传统点级融合限制的框架,通过在共享鸟瞰视图中统一多模态特征,有效解决摄像头到LiDAR的投影问题,显著提升3D物体检测和BEV地图分割的性能,同时降低计算成本。
摘要由CSDN通过智能技术生成

https://arxiv.org/pdf/2205.13542.pdf

零、Abstract

在这里插入图片描述

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system.
多传感器融合对于准确和可靠的自动驾驶系统至关重要。

Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features.
最近的方法基于点级融合:用摄像头特征增强激光雷达点云。

However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation).
然而,摄像头到激光雷达的投影 丢弃了 摄像头特征的语义密度,这 阻碍了 这种方法的有效性,特别是对于语义导向的任务(如3D场景分割)。

In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework.
在本文中,我们用BEVFusion打破了这一根深蒂固的传统,BEVFusion是一个高效和通用的多任务多传感器融合框架

It unifies multi-modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information.
它在共享的鸟瞰视图(BEV)表示空间中统一了多模态特征,从而 很好地保留了 几何和语义信息。

To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40×.
为了实现这一点,我们通过优化BEV Pooling, 来诊断和提升视图转换中的关键效率瓶颈,从而将 延迟降低了40倍以上

BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes.
BevFusion从根本上来说是与任务无关的,并且几乎不需要架构更改就能无缝支持不同的3D感知任务。

It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9× lower computation cost.
它在nuScenes上达到了SOTA,实现了3D物体检测上1.3%更高的mAP和NDS,以及BEV地图分割上13.6%更高的mIoU,计算成本降低了1.9倍。

Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
用于复现我们结果的代码可在https://github.com/mit-han-lab/bevfusion 上找到。

一、Introduction

在这里插入图片描述
在这里插入图片描述

小结1:
自动驾驶系统配备了多种传感器,提供互补的信号,所以多传感器融合对于准确和可靠的感知非常重要

Autonomous driving systems are equipped with diverse sensors.
自动驾驶系统配备了多种传感器。

For instance, Waymo’s self-driving vehicles have 29 cameras, 6 radars, and 5 LiDARs.
例如,Waymo的自动驾驶车辆有29个摄像头,6个雷达,和5个激光雷达。

Different sensors provide complementary signals.
不同的传感器提供互补的信号。

E.g., cameras capture rich semantic information, LiDARs provide accurate spatial information, while radars offer instant velocity estimation.
例如,摄像头捕捉丰富的语义信息,激光雷达提供准确的空间信息,而雷达提供即时的速度估计。

Therefore, multi-sensor fusion is of great importance for accurate and reliable perception.
因此,多传感器融合对于准确和可靠的感知非常重要。

小结2:
对于多传感器,找到一个适用于多任务多模态特征融合的统一表示很重要。激光雷达到摄像头的投影引入了严重的几何失真

Data from different sensors are expressed in fundamentally different modalities.
来自不同传感器的数据以根本不同的方式表示。

E.g., cameras capture data in perspective view and LiDAR in 3D view.
例如,摄像头以透视视图捕捉数据,而激光雷达以3D视图捕捉数据。

To resolve this view discrepancy, we have to find a unified representation that is suitable for multi-task multi-modal feature fusion.
为了解决这种视图差异,我们必须找到一个适用于多任务多模态特征融合的统一表示

Due to the tremendous success in 2D perception, the natural idea is to project the LiDAR point cloud onto the camera and process the RGB-D data with 2D CNNs.
由于在2D感知方面取得了巨大成功,自然的想法是将激光雷达点云投影到摄像头上,并使用2D CNNs处理RGB-D数据。

However, this LiDAR-to-camera projection introduces severe geometric distortion (see Figure 1a), which makes it less effective for geometric-oriented tasks, such as 3D object recognition.
然而,这种从激光雷达到摄像头的投影引入了严重的几何失真(见图1a),这使得它对于几何导向的任务(如3D对象识别)效果不佳。传送门

小结3.
用其他输入增强点云数据,检测效果好,地图分割等语义任务不好。因为摄像头到激光雷达的投影在语义上是有损的。

Recent sensor fusion methods follow the other direction.
最近的传感器融合方法遵循另一个方向。

They augment the LiDAR point cloud with semantic labels [54], CNN features [55, 23] or virtual points from 2D images [68], and then apply an existing LiDAR-based detector to predict 3D bounding boxes.
它们通过语义标签[54]、CNN特征[55, 23]或来自2D图像的虚拟点[68]来增强激光雷达点云,然后应用现有的基于激光雷达的检测器来预测3D边界框。

Although they have demonstrated remarkable performance on large-scale detection benchmarks, these point-level fusion methods barely work on semantic-oriented tasks, such as BEV map segmentation [37, 39, 22, 70].
尽管它们在大规模检测基准测试上表现出色,但这些点级融合方法几乎不能用于面向语义的任务,如BEV地图分割[37, 39, 22, 70]。

This is because the camera-to-LiDAR projection is semantically lossy (see Figure 1b): for a typical 32-beam LiDAR scanner, only 5% camera features will be matched to a LiDAR point while all others will be dropped.
这是因为摄像头到激光雷达的投影在语义上是有损的(见图1b):对于一个典型的32线激光雷达扫描仪,只有5%的相机特征能够与LiDAR点匹配,而所有其他特征都将被丢弃。


1)首先,有直接使用3D点云,进行3D目标检测网络。PointPillars、PointRCNN。
在这里插入图片描述
2019年PointPillars

在这里插入图片描述
PointRCNN 2019年

2)所以自然而然也会想到用其他特征增强Lidar特征。将LiDAR点云数据投影到2D图像以提取特征并加强LiDAR特征是一种常见的传感器融合技术
3)但是Lidar点云的分辨率跟图像分辨率差异,特征不匹配,环境因素都会有影响。
4)To LiDAR虽然看图,像是和a一样投影到图像上。单本质的区别是,b是用3D检测。a图是2D检测
5)这里额外提一下,3D reference point去不同模态数据取特征的算法FUTR3D。
在这里插入图片描述
*2023年 FUTER3D *

Such density differences will become even more drastic for sparser LiDARs (or imaging radars).
对于更稀疏的激光雷达(或成像雷达),这种密度差异将变得更加剧烈。


小结4:
较重要,全看

In this paper, we propose BEVFusion to unify multi-modal features in a shared bird’s-eye view (BEV) representation space for task-agnostic learning.
在本论文中,我们提出了BEVFusion,以便在一个共享的鸟瞰视图(BEV)表示空间中统一多模态特征,用于任务无关的学习。

We maintain both geometric structure and semantic density (see Figure 1c) and naturally support most 3D perception tasks (since their output space can be naturally captured in BEV).
我们保持了几何结构和语义密度(见图1c),并自然地支持大多数3D感知任务(因为它们的输出空间可以自然地在BEV中捕获)。
在这里插入图片描述
在这里插入图片描述

While converting all features to BEV, we identify the major prohibitive efficiency bottleneck in the view transformation: i.e., the BEV pooling operation alone takes more than 80% of the model’s runtime.
在将所有特征转换为BEV时,我们识别了视图转换中主要的效率瓶颈:即,BEV池化操作单独占用了模型运行时间的80%以上。

Then, we propose a specialized kernel with precomputation and interval reduction to eliminate this bottleneck, achieving more than 40× speedup.
然后,我们提出了一个带有预计算和区间缩减的专用内核,以消除这个瓶颈,实现了超过40倍的加速。

    • 题外话–作者发布的issue在这里插入图片描述

      • 预计算似乎没有使用。依据上方链接。
      • 但是这个预计算的思路是对的。且能节省大量时间。Lidar_AI_Solution中bev池化,BEVPoolV2。都是有预计算的思想在。所以速度大大提升。

    Finally, we apply the fully-convolutional BEV encoder to fuse the unified BEV features and append a few task-specific heads to support different target tasks.
    最后,我们应用全卷积的BEV编码器来融合统一的BEV特征,并添加了几个任务特定的头部以支持不同的目标任务。

小结5
提升效果

BEVFusion sets the new state-of-the-art performance on the nuScenes benchmark.
BEVFusion在nuScenes基准测试上设定了新的最先进性能。

On 3D object detection, it ranks 1st on the leaderboard among all solutions.
在3D物体检测方面,它在所有解决方案中排名榜单第一。

BEVFusion demonstrates even more significant improvements on BEV map segmentation.
BEVFusion在BEV地图分割方面表现出更为显著的改进。

It achieves 6% higher mIoU than camera-only models and 13.6% higher mIoU than LiDAR-only models, while existing fusion methods hardly work.
与仅使用相机的模型相比,它的mIoU提高了6%,与仅使用LiDAR的模型相比,提高了13.6%,而现有的融合方法几乎无效。
https://www.nuscenes.org/object-detection?externalData=all&mapData=all&modalities=Any

BEVFusion is efficient, delivering all these results with 1.9× lower computation cost.
BEVFusion非常高效,以1.9倍更低的计算成本实现了所有这些结果。

小结 6

BEVFusion breaks the long-standing belief that point-level fusion is the best solution to multi-sensor fusion.
BEVFusion打破了长久以来认为点级融合是多传感器融合最佳解决方案的传统观念。

Simplicity is also its key strength.
简单性也是其关键优势。

We hope this work will serve as a simple yet strong baseline for future sensor fusion research and inspire the researchers to rethink the design and paradigm for generic multi-task multi-sensor fusion.
我们希望这项工作能作为未来传感器融合研究的一个简单但强大的基准,并激励研究人员重新思考通用多任务多传感器融合的设计和范式。

二、Related Work

在这里插入图片描述
在这里插入图片描述

文献回顾、方法参考、证明创新性、避免重复、为后续研究提供方向

  • LiDAR-Based 3D Perception.

      • 基于LiDAR的三维感知
        Researchers have designed single-stage 3D object detectors [72, 21, 65, 73, 66, 71] that extract flattened point cloud features using PointNets [41] or SparseConvNet [17] and perform detection in the BEV space.
        研究人员设计了单阶段3D目标检测器[72, 21, 65, 73, 66, 71],这些检测器使用PointNets [41]或SparseConvNet [17]提取平坦化的点云特征,并在BEV空间中进行检测。
      • Later, Yin et al. [67] and others [15, 5, 42, 14, 6, 60] have explored anchor-free 3D object detection.
        后来,Yin等人[67]和其他人[15, 5, 42, 14, 6, 60]探索了无锚点的3D目标检测。
      • Another stream of research [49, 10, 50, 47, 48, 24] focuses on two-stage object detection, which adds an RCNN network to existing one-stage object detectors.
        另一方面的研究[49, 10, 50, 47, 48, 24]关注于两阶段目标检测,该方法在现有的单阶段目标检测器上添加了一个RCNN网络。
      • There are also U-Net like models specialized for 3D semantic segmentation [17, 13, 52, 33, 75], an important task for offline HD map construction.
        还有专门用于3D语义分割[17, 13, 52, 33, 75]的U-Net样模型,这是离线高清地图构建的重要任务。
  • Camera-Based 3D Perception.
    基于摄像头的三维感知。

      • Due to the high cost of LiDAR sensors, researchers have spent significant efforts on camera-only 3D perception.
        由于LiDAR传感器的高成本,研究人员在仅使用摄像头的三维感知上投入了大量努力。
      • FCOS3D [57] extends image detectors [53] with additional 3D regression branches, which is later improved in terms of depth modeling [58, 4].
        FCOS3D [57]通过添加额外的3D回归分支来扩展图像检测器[53],并在深度建模方面进行了后续改进[58, 4]。
      • Instead of performing object detection in the perspective view, DETR3D [59], PETR [30] and GraphDETR3D [11] design DETR [74, 61]-based detection heads with learnable object queries in the 3D space.
        与其在透视视图中进行目标检测,DETR3D [59]、PETR [30] 和 GraphDETR3D [11]设计了基于DETR [74, 61]的检测头,这些检测头在3D空间中具有可学习的目标查询。
      • Inspired by the design of LiDAR-based detectors, another type of camera-only 3D perception models explicitly converts the camera features from perspective view to the bird’s-eye view using a view transformer [37, 46, 45, 39].
        受到基于LiDAR的检测器设计的启发,另一种仅使用摄像头的三维感知模型使用视图变换器[37, 46, 45, 39]将摄像头特征从透视视图明确地转换到鸟瞰视图。
      • BEVDet [20] and M2BEV [63] effectively extend LSS [39] and OFT [46] to 3D object detection, achieving state-of-the-art performance upon release.
        BEVDet [20]和M2BEV [63]有效地将LSS [39]和OFT [46]扩展到3D目标检测,在发布时实现了最先进的性能。
      • CaDDN [43] adds explicit depth estimation supervision to the view transformer.
        CaDDN [43]为视图变换器添加了明确的深度估计监督。
      • BEVDet4D [19], BEVFormer [25] and PETRv2 [31] exploit temporal cues in multi-camera 3D object detection, achieving significant improvement over single-frame methods.
        BEVDet4D [19]、BEVFormer [25]和PETRv2 [31]利用多摄像头3D目标检测中的时间线索,相较于单帧方法实现了显著改进。
      • BEVFormer [25], CVT [70] and Ego3RT [35] also study using multi-head attention to perform the view transformation.
        BEVFormer [25]、CVT [70]和Ego3RT [35]还研究了使用多头注意力来执行视图转换。
  • Multi-Sensor Fusion.
    多传感器融合。

      • Recently, multi-sensor fusion arouses increased interest in the 3D detection community.
        近期,多传感器融合在3D检测领域引起了越来越多的关注。
      • Existing approaches can be classified into proposal-level and point-level fusion methods.
        现有的方法可以分为提议级别和点级别的融合方法。
      • MV3D [7] creates object proposals in 3D and projects the proposals to images to extract RoI features.
        MV3D [7]在3D空间中创建对象提议,并将这些提议投影到图像上以提取RoI特征。
      • F-PointNet [40], F-ConvNet [62] and CenterFusion [36] all lift image proposals into a 3D frustum.
        F-PointNet [40]、F-ConvNet [62]和CenterFusion [36]都将图像提议提升到一个3D截锥体中。
      • Lately, FUTR3D [8] and TransFusion [1] define object queries in the 3D space and fuse image features onto these proposals.
        最近,FUTR3D [8]和TransFusion [1]在3D空间中定义对象查询,并将图像特征融合到这些提议中。
      • Proposal-level fusion methods are object-centric and cannot trivially generalize to other tasks such as BEV map segmentation.
        提议级别的融合方法以对象为中心,不能轻易地推广到其他任务,如BEV地图分割。
      • Point-level fusion methods, on the other hand, usually paint image semantic features onto foreground LiDAR points and perform LiDAR-based detection on the decorated point cloud inputs.
        另一方面,点级别的融合方法通常将图像语义特征绘制到前景LiDAR点上,并在装饰过的点云输入上执行基于LiDAR的检测。
      • As such, they are both object-centric and geometric-centric.
        因此,它们既以对象为中心,也以几何为中心。
      • Among all these methods, PointPainting [54], PointAugmenting [55], MVP [68], FusionPainting [64], AutoAlign [12] and FocalSparseCNN [9] are (LiDAR) input-level decoration, while Deep Continuous Fusion [27] and DeepFusion [23] are feature-level decoration.
        在所有这些方法中,PointPainting [54]、PointAugmenting [55]、MVP [68]、FusionPainting [64]、AutoAlign [12]和FocalSparseCNN [9]是(LiDAR)输入级别的装饰,而Deep Continuous Fusion [27]和DeepFusion [23]是特征级别的装饰。
  • Multi-Task Learning.
    多任务学习。

      • Multi-task learning have been well-studied in the computer vision community.
        多任务学习在计算机视觉社群中已经得到了深入的研究。
      • Researchers have studied to jointly perform object detection and instance segmentation [44, 3] and have extended to pose estimation and human-object interaction [18, 51, 56, 16].
        研究人员已经研究了如何联合进行对象检测和实例分割[44, 3],并已扩展到姿态估计和人-物交互[18, 51, 56, 16]。
      • A few concurrent works including M2BEV [63], BEVFormer [25] and BEVerse [69] jointly perform object detection and BEV segmentation in 3D.
        包括M2BEV [63]、BEVFormer [25]和BEVerse [69]在内的一些并行工作在3D中联合进行对象检测和BEV分割。
      • None of the above methods considers multi-sensor fusion.
        上述方法都没有考虑多传感器融合。
      • MMF [26] simultaneously works on depth completion and object detection with both camera and LiDAR inputs, but is still object-centric and not applicable to BEV map segmentation.
        MMF [26] 同时使用摄像头和LiDAR输入进行深度完成和对象检测,但仍然以对象为中心,不适用于BEV地图分割。

In contrast to all existing methods, BEVFusion performs sensor fusion in a shared BEV space and treats foreground and background, geometric and semantic information equally.
与所有现有方法相比,BEVFusion在一个共享的BEV空间中进行传感器融合,并平等地对待前景和背景、几何和语义信息。

BEVFusion is a generic multi-task multi-sensor perception framework.
BEVFusion是一个通用的多任务多传感器感知框架

  • 17
    点赞
  • 23
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值