BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation（论文原文阅读）

不解风水

已于 2024-03-19 16:42:46 修改

阅读量98

点赞数

文章标签：自动翻译自动驾驶

于 2024-03-19 16:42:12 首次发布

原文链接：https://arxiv.org/pdf/2205.13542.pdf

版权

翻译-原文对照形式呈现

声明：此翻译仅为个人学习记录

文章信息

标题：BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation (ICRA 2023)
作者：Zhijian Liu*, Haotian Tang*, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, Song Han (* indicates equal contributions)
文章链接：https://arxiv.org/pdf/2205.13542.pdf
文章代码：https://github.com/mit-han-lab/bevfusion

摘要

多传感器融合对于精确可靠的自动驾驶系统至关重要。最近的方法是基于点级融合：用相机特征增强激光雷达点云。然而，相机到激光雷达的投影丢弃了相机特征的语义信息，阻碍了这些方法的有效性，特别是对于面向语义的任务(如3D场景分割)。

在本文中，我们用BEVFusion打破了这种根深蒂固的惯例，BEVFusion是一种高效且通用的多任务多传感器融合框架。它将多模态特征统一到共享鸟瞰图(BEV)表示空间中，很好地保留了几何信息和语义信息。为了实现这一目标，我们通过优化的BEV池来诊断和提升视图转换中的关键效率瓶颈，将延迟减少了40倍以上。

BEVFusion基本上是任务无关的，无缝地支持不同的3D感知任务，几乎没有架构上的变化。它在nuScenes上建立了新的技术水平，在3D物体检测上实现了1.3%的mAP和NDS，在BEV地图分割上实现了13.6%的mIoU，计算成本降低了1.9倍。

复现我们的结果的代码可从https://github.com/mit-han-lab/bevfusion获得。

Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the Li-DAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmen-tation). In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diag-nose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40×. BEVFusion is fundamentally
task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on nuScenes, achieving
1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9× lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.

1.介绍

自动驾驶系统配备了多种传感器。例如，Waymo的自动驾驶汽车有29个摄像头、6个雷达和5个激光雷达。不同的传感器提供互补的信号：例如，摄像头捕获丰富的语义信息，激光雷达提供准确的空间信息，而雷达提供即时速度估计。因此，多传感器融合对于实现准确、可靠的感知具有重要意义。

Autonomous driving systems are equipped with diverse sensors. For instance, Waymo’s self-driving vehicles have 29 cameras, 6 radars, and 5 LiDARs. Different sensors provide complementary signals:e.g., cameras capture rich semantic information, LiDARs provide accurate spatial information, while radars offer instant velocity estimation. Therefore, multi-sensor fusion is of great importance for accurate and reliable perception.

来自不同传感器的数据以根本不同的方式表达:例如，相机以透视视图捕获数据，而激光雷达以3D视图捕获数据。为了解决这种视图差异，我们必须找到一种适合于多任务多模态特征融合的统一表示。由于二维感知的巨大成功，自然的想法是将激光雷达点云投影到相机上，并用二维CNN处理RGB-D数据。然而，这种激光雷达到相机的投影引入了严重的几何失真(见图1a)，这使得它在面向几何的任务(如3D物体识别)中效果较差。

Data from different sensors are expressed in fundamentally different modalities: e.g., cameras capture data in perspective view and LiDAR in 3D view. To resolve this view discrepancy, we have to find a unified representation that is suitable for multi-task multi-modal feature fusion. Due to the tremendous success in 2D perception, the natural idea is to project the LiDAR point cloud onto the camera and process the RGB-D data with 2D CNNs. However, this LiDAR-to-camera projection introduces severe geometric distortion (see Figure 1a), which makes it less effective for geometric-oriented tasks, such as 3D object recognition.

（a）对镜头:几何有损

（b）对激光雷达来说:语义丢失

（c）共享BEVFusion空间

图1:BEVFusion在共享的BEV空间中统一了摄像头和LiDAR功能，而不是将一种模式映射到另一种模式。它保留了相机的语义密度和激光雷达的几何结构。

最近的传感器融合方法遵循另一个方向。他们使用语义标签[54]、CNN特征[55,23]或2D图像中的虚拟点[68]来增强LiDAR点云，然后应用现有的基于LiDAR的检测器来预测3D边界框。尽管他们已经证明在大规模检测的基准测试中，这些点级融合方法表现不佳致力于面向语义的任务，如BEV地图分割[37,39,22,70]。这是因为对于典型的32束激光雷达，相机到激光雷达的投影在语义上是有损耗的(见图1b)当使用激光雷达扫描器时，只有5%的相机特征与激光雷达点匹配，而其他所有特征都将被丢弃。对于稀疏的激光雷达(或成像雷达)，这种密度差异将变得更加明显。

Recent sensor fusion methods follow the other direction. They augment the LiDAR point cloud with semantic labels [54 ], CNN features [55, 23] or virtual points from 2D images [68], and then apply an existing LiDAR-based detector to predict 3D bounding boxes. Although they have demonstrated remarkable performance on large-scale detection benchmarks, these point-level fusion methods barely work on semantic-oriented tasks, such as BEV map segmentation [ 37 , 39, 22, 70 ]. This is because the camera-to-LiDAR projection is semantically lossy (see Figure 1b): for a typical 32-beam LiDAR scanner, only 5% camera features will be matched to a LiDAR point while all others will be dropped. Such density differences will become even more drastic for sparser LiDARs (or imaging radars).

在本文中，我们提出BEVFusion在一个共享鸟瞰(BEV)表示空间中统一多模态特征，用于任务无关的学习。我们保持了几何结构和语义密度(见图1c)，并且自然地支持大多数3D感知任务(因为它们的输出空间可以自然地在BEV中捕获)。在将所有特征转换为BEV时，我们确定了视图转换中主要的令人望而却步的效率瓶颈:即，仅BEV池化操作就占用了超过80%的模型运行时间。然后，我们提出了一个具有预计算和间隔缩短的专用内核来消除这一瓶颈，实现了超过40倍的加速。最后，我们应用全卷积BEV编码器融合统一的BEV特征，并添加一些特定任务的头来支持不同的目标任务。

In this paper, we propose BEVFusion to unify multi-modal features in a shared bird’s-eye view (BEV) representation space for task-agnostic learning. We maintain both geometric structure and semantic density (see Figure 1c) and naturally support most 3D perception tasks (since their output space can be naturally captured in BEV). While converting all features to BEV, we identify the major prohibitive efficiency bottleneck in the view transformation: i.e., the BEV pooling operation alone takes more than 80% of the model’s runtime. Then, we propose a specialized kernel with precomputation and interval reduction to eliminate this bottleneck, achieving more than 40× speedup. Finally, we apply the fully-convolutional BEV encoder to fuse the unified BEV features and append a few task-specific
heads to support different target tasks.

BEVFusion在nuScenes基准上设定了新的最先进的性能。关于3D对象检测，它在所有解决方案的排行榜上排名第一。BEVFusion展示了更多BEV地图分割的显著改进。在现有的融合方法难以实现的情况下，它的MIOU比纯相机模型高6%,比纯激光雷达模型高13.6%。BEVFusion是高效的，以低1.9倍的计算成本提供了所有的这些结果。

BEVFusion sets the new state-of-the-art performance on the nuScenes benchmark. On 3D object detection, it ranks 1st on the leaderboard among all solutions. BEVFusion demonstrates even more significant improvements on BEV map segmentation. It achieves 6% higher mIoU than camera-only models and 13.6% higher mIoU than LiDAR-only models, while existing fusion methods hardly work. BEVFusion is efficient, delivering all these results with 1.9× lower computation cost.

BEVFusion打破了长期以来的信念，即点级融合是多传感器融合的最佳解决方案。简单也是它的关键优势。我们希望这项工作能够为未来的传感器融合研究提供一个简单而有力的基础，并启发研究人员重新思考通用多任务多传感器融合的设计和范式。

BEVFusion breaks the long-standing belief that point-level fusion is the best solution to multi-sensor fusion. Simplicity is also its key strength. We hope this work will serve as a simple yet strong baseline for future sensor fusion research and inspire the researchers to rethink the design and paradigm for generic multi-task multi-sensor fusion.

2.相关工作

基于激光雷达的3D感知。研究人员设计了单级3D目标探测器[72,21,65,73,66,71]，使用PointNets[41]或SparseConvNet[17]提取扁平的点云特征，并在BEV空间进行检测。后来，Yin等人[67]和其他人[15,5,42,14,6,60]探索了无锚三维目标检测。另一项研究[49,10,50,47,48,24]侧重于两阶段目标检测，将RCNN网络添加到现有的单阶段目标检测器中。还有类似U-Net的模型专门用于3D语义分割[17,13,52,33,75]，这是离线高清地图构建的重要任务。

LiDAR-Based 3D Perception. Researchers have designed single-stage 3D object detectors [72 , 21, 65 , 73, 66 , 71] that extract flattened point cloud features using PointNets [41] or SparseConvNet [17 ] and perform detection in the BEV space. Later, Yin et al. [67 ] and others [15, 5 , 42 , 14, 6, 60 ] have explored anchor-free 3D object detection. Another stream of research [ 49 , 10 , 50 , 47, 48 , 24 ] focuses on two-stage object detection, which adds an RCNN network to existing one-stage object detectors. There are also U-Net like models specialized for 3D semantic segmentation [17 , 13, 52 , 33 , 75], an important task for offline HD map construction.

基于摄像头的3D感知。由于激光雷达传感器的高成本，研究人员在仅摄像头的3D感知上花费了大量精力。FCOS3D[57]通过额外的三维回归分支扩展了图像检测器[53]，随后在深度建模方面进行了改进[58,4]。DETR3D[59]、PETR[30]和Graph- DETR3D[11]设计了基于DETR[74,61]的检测头，并在三维空间中进行可学习的对象查询，而不是在透视图中进行对象检测。受基于lidar的探测器设计的启发，另一种仅摄像头的3D感知模型使用视图转换器明确地将摄像头特征从透视视图转换为鸟瞰图[37,46,45,39]。BEVDet[20]和M2BEV[63]有效地将LSS[39]和OFT[46]扩展到3D目标检测，发布后达到了最先进的性能。CaDDN[43]为视图转换器添加了显式的深度估计监督。BEVDet4D[19]、BEVFormer[25]和PETRv2[31]利用时间线索进行多摄像头三维目标检测，比单帧方法有了显著改进。BEVFormer[25]、CVT[70]和Ego3RT[35]也进行了研究使用多头注意力来执行视图转换。

Camera-Based 3D Perception. Due to the high cost of LiDAR sensors, researchers have spent significant efforts on camera-only 3D perception. FCOS3D [57 ] extends image detectors [53] with additional 3D regression branches, which is later improved in terms of depth modeling [58 , 4]. Instead of performing object detection in the perspective view, DETR3D [59 ], PETR [30 ] and Graph- DETR3D [ 11] design DETR [74 , 61 ]-based detection heads with learnable object queries in the 3D space. Inspired by the design of LiDAR-based detectors, another type of camera-only 3D perception models explicitly converts the camera features from perspective view to the bird’s-eye view using a view transformer [37 , 46, 45 , 39 ]. BEVDet [20 ] and M2BEV [63 ] effectively extend LSS [39 ] and OFT [46] to 3D object detection, achieving state-of-the-art performance upon release. CaDDN [43 ] adds explicit depth estimation supervision to the view transformer. BEVDet4D [19 ], BEVFormer [ 25 ] and PETRv2 [ 31 ] exploit temporal cues in multi-camera 3D object detection, achieving significant improvement over single-frame methods. BEVFormer [25 ], CVT [ 70] and Ego3RT [ 35 ] also study using multi-head attention to perform the view transformation.

图2:BEVFusion从多模态输入中提取特征，并使用视图转换将其有效地转换为共享的鸟瞰(BEV)空间。它将统一的BEV功能与全卷积BEV编码器融合在一起，并支持具有特定任务头部的不同任务

多传感器融合。近年来，多传感器融合引起了三维探测界越来越多的兴趣。现有的融合方法可分为提议级和点级两种。MV3D[7]在3D中创建目标提案，并将提案投影到图像中以提取RoI特征。F-PointNet[40]、F-ConvNet[62]和CenterFusion[36]都将图像建议提升为3D截锥体。最近，FUTR3D[8]和TransFusion[1]定义了3D空间中的对象查询，并将图像特征融合到这些提议中。提案级融合方法以对象为中心，不能简单地推广到其他任务，如BEV地图分割。另一方面，点级融合方法通常是将图像语义特征绘制到前景LiDAR点上，并对修饰后的点云输入进行基于LiDAR的检测。因此，它们既以对象为中心，又以几何为中心。在这些方法中，PointPainting[54]、pointaugmentation[55]、MVP[68]、FusionPainting[64]、AutoAlign[12]和FocalSparseCNN[9]是(LiDAR)输入级装饰，Deep Continuous Fusion[27]和DeepFusion[23]是特征级装饰。

Multi-Sensor Fusion. Recently, multi-sensor fusion arouses increased interest in the 3D detection community. Existing approaches can be classified into proposal-level and point-level fusion methods. MV3D [ 7] creates object proposals in 3D and projects the proposals to images to extract RoI features. F-PointNet [ 40], F-ConvNet [62 ] and CenterFusion [36 ] all lift image proposals into a 3D frustum. Lately, FUTR3D [8 ] and TransFusion [1] define object queries in the 3D space and fuse image features onto these proposals. Proposal-level fusion methods are object-centric and cannot trivially generalize to other tasks such as BEV map segmentation. Point-level fusion methods, on the other hand, usually paint image semantic features onto foreground LiDAR points and perform LiDAR-based detection on the decorated point cloud inputs. As such, they are both object-centric and geometric-centric. Among all these methods, PointPainting [54 ], PointAugmenting [ 55], MVP [ 68], FusionPainting [64], AutoAlign [ 12 ] and FocalSparseCNN [ 9] are (LiDAR) input-level decoration, while Deep Continuous Fusion [27] and DeepFusion [23] are feature-level decoration.

多任务学习。多任务学习在计算机视觉学界得到了广泛的研究。研究人员已经研究了联合进行目标检测和实例分割[44,3]，并已扩展到姿态估计和人-目标交互[18,51,56,16]。M2BEV[63]、BEVFormer[25]和BEVerse[69]等几个并行的作品共同完成了三维物体检测和BEV分割。以上方法都没有考虑多传感器融合。MMF[26]同时使用相机和LiDAR输入进行深度补全和目标检测，但仍然以对象为中心，不适用于BEV地图分割。

与所有现有方法相比，BEVFusion在共享的BEV空间中进行传感器融合，并平等地对待前景和背景、几何和语义信息。BEVFusion是一个通用的多任务多传感器感知框架。

Multi-Task Learning. Multi-task learning have been well-studied in the computer vision community. Researchers have studied to jointly perform object detection and instance segmentation [44 , 3 ] and have extended to pose estimation and human-object interaction [ 18, 51, 56 , 16 ]. A few concurrent works including M2BEV [ 63], BEVFormer [25 ] and BEVerse [ 69] jointly perform object detection and BEV segmentation in 3D. None of the above methods considers multi-sensor fusion. MMF [26] simultaneously works on depth completion and object detection with both camera and LiDAR inputs, but is still object-centric and not applicable to BEV map segmentation. In contrast to all existing methods, BEVFusion performs sensor fusion in a shared BEV space and treats foreground and background, geometric and semantic information equally. BEVFusion is a generic multi-task multi-sensor perception framework。

3.方法

BEVFusion专注于多传感器融合(即多视角相机和激光雷达)，用于多任务3D感知(即检测和分割)。我们在图2中概述了我们的框架。给定不同的感官输入，我们首先应用特定于模态的编码器来提取它们的特征。我们将多模态特征转换为统一的BEV表示，该表示保留了几何和语义信息。我们识别了视图转换的效率瓶颈，并通过预计算和间隔缩减加速了BEV池化。然后，我们将基于卷积的BEV编码器应用于统一的BEV特征，以减轻不同特征之间的局部不对齐。最后，我们添加了一些特定于任务的头来支持不同的3D任务。

BEVFusion focuses on multi-sensor fusion (i.e., multi-view cameras and LiDAR) for multi-task 3D perception (i.e., detection and segmentation). We provide an overview of our framework in Figure 2. Given different sensory inputs, we first apply modality-specific encoders to extract their features. We transform multi-modal features into a unified BEV representation that preserves both geometric and semantic information. We identify the efficiency bottleneck of the view transformation and accelerate BEV pooling with precomputation and interval reduction. We then apply the convolution-based BEV
encoder to the unified BEV features to alleviate the local misalignment between different features. Finally, we append a few task-specific heads to support different 3D tasks。

3.1统一表示

不同的特性可以存在于不同的视图中。例如，摄像头功能在透视视图中，而激光雷达/雷达功能通常在3D/鸟瞰视图中。即使是相机功能，每一个都有一个不同的视角(即，前，后，左，右)。由于相同的元素在不同的特征张量中可能对应，这种视图差异使得特征融合变得困难到完全不同的空间位置(和naïve元素特征融合将无法工作这种情况下)。因此，找到一个共享的表示是至关重要的，这样(1)所有的传感器特征都可以易于转换，无信息丢失;(2)适用于不同类型的任务。

Different features can exist in different views. For instance, camera features are in the perspective view, while LiDAR/radar features are typically in the 3D/bird’s-eye view. Even for camera features, each one of them has a distinct viewing angle (i.e., front, back, left, right). This view discrepancy makes the feature fusion difficult since the same element in different feature tensors might correspond to completely different spatial locations (and the naïve elementwise feature fusion will not work in this case). Therefore, it is crucial to find a shared representation, such that (1) all sensor features can be easily converted to it without information loss, and (2) it is suitable for different types of tasks。

图3:Camera-to-BEV转换(a)是在统一的BEV空间中进行传感器融合的关键步骤。然而，现有的实现非常缓慢，单个场景可能需要长达25秒的时间。我们提出了高效的BEV池(b)，使用间隔约简和快速网格关联与预计算，使视图转换模块的速度提高了40倍(c, d)。

相机。在RGB-D数据的驱动下，一种选择是将LiDAR点云投影到相机平面，并渲染2.5D稀疏深度。然而，这种转换在几何上是有损的。深度图上的两个邻居在3D空间中可能相距很远。这使得相机视图对于专注于物体/场景几何形状的任务(如3D物体检测)不太有效。

To Camera. Motivated by RGB-D data, one choice is to project the LiDAR point cloud to the
camera plane and render the 2.5D sparse depth. However, this conversion is geometrically lossy. Two neighbors on the depth map can be far away from each other in the 3D space. This makes the camera view less effective for tasks that focus on the object/scene geometry, such as 3D object detection.

激光雷达。大多数最先进的传感器融合方法[54,68,23]用相应的相机特征(例如，语义标签、CNN特征或虚拟点)修饰LiDAR点。然而，这种相机到激光雷达的投影在语义上是有损耗的。相机和激光雷达的特征密度截然不同，导致只有不到5%的相机特征与激光雷达点匹配(对于32通道激光雷达扫描仪)。放弃相机特征的语义密度会严重影响模型在面向语义的任务(如BEV地图分割)上的性能。类似的缺陷也适用于潜在空间中较新的融合方法(例如，对象查询)[8,1]。

To LiDAR. Most state-of-the-art sensor fusion methods [54 , 68, 23 ] decorate LiDAR points with their corresponding camera features (e.g., semantic labels, CNN features or virtual points). However, this camera-to-LiDAR projection is semantically lossy. Camera and LiDAR features have drastically different densities, resulting in only less than 5% of camera features being matched to a LiDAR point (for a 32-channel LiDAR scanner). Giving up the semantic density of camera features severely hurts the model’s performance on semantic-oriented tasks (such as BEV map segmentation). Similar drawbacks also apply to more recent fusion methods in the latent space (e.g., object query) [8, 1].

鸟瞰图。我们采用鸟瞰图(BEV)作为融合的统一表示。这个视图对几乎所有的感知任务都是友好的，因为输出空间也是在BEV中。更重要的是，向纯电动汽车的转换同时保持了几何结构(来自激光雷达特征)和语义密度(来自相机特征)。一方面，LiDAR-to- bev投影将稀疏的LiDAR特征沿高度维度平坦化，因此不会产生图1a中的几何畸变。另一方面，相机到BEV投影将每个相机特征像素投射回3D空间中的光线中(详见下一节)，这可以产生图1c中密集的BEV特征图，保留了来自相机的完整语义信息。

To Bird’s-Eye View. We adopt the bird’s-eye view (BEV) as the unified representation for fusion. This view is friendly to almost all perception tasks since the output space is also in BEV. More importantly, the transformation to BEV keeps both geometric structure (from LiDAR features) and semantic density (from camera features). On the one hand, the LiDAR-to-BEV projection flattens the sparse LiDAR features along the height dimension, thus does not create geometric distortion in Figure 1a. On the other hand, camera-to-BEV projection casts each camera feature pixel back into a ray in the 3D space (detailed in the next section), which can result in a dense BEV feature map in Figure 1c that retains full semantic information from the cameras.

3.2高效的摄像头到bev转换

相机到bev的转换是非常重要的，因为与每个相机特征像素相关联的深度本质上是模糊的。在LSS[39]和BEVDet[20,19]之后，我们明确地预测了每个像素的离散深度分布。然后，我们沿着相机光线将每个特征像素分散到D个离散点，并根据其相应的深度概率重新缩放相关特征(图3a)。生成大小为N HW D的相机特征点云，其中N为相机数量，(H, W)为相机特征图大小。该三维特征点云沿x、y轴量化，步长为r(如0.4m)。我们使用BEV池操作聚合每个r × r BEV网格内的所有特征，并沿z轴平坦化特征。

Camera-to-BEV transformation is non-trivial because the depth associated with each camera feature pixel is inherently ambiguous. Following LSS [39 ] and BEVDet [ 20, 19 ], we explicitly predict the discrete depth distribution of each pixel. We then scatter each feature pixel into D discrete points along the camera ray and rescale the associated features by their corresponding depth probabilities (Figure 3a). This generates a camera feature point cloud of size N HW D, where N is the number of cameras and (H, W ) is the camera feature map size. Such 3D feature point cloud is quantized along the x, y axes with a step size of r (e.g., 0.4m). We use the BEV pooling operation to aggregate all features within each r × r BEV grid and flatten the features along the z-axis.

虽然很简单，BEV池却非常低效和缓慢，在RTX 3090 GPU上需要超过500毫秒(而我们模型的其余部分只需要大约100毫秒)。这是因为相机特征点云非常大:对于典型的工作负载*，每帧可能产生大约200万个点，比激光雷达特征点云密度大两个数量级。为了解决这一效率瓶颈，我们提出了通过预计算和缩短间隔来优化BEV池。

Though simple, BEV pooling is surprisingly inefficient and slow, taking more than 500ms on an RTX 3090 GPU (while the rest of our model only takes around 100ms). This is because the camera feature point cloud is very large: for a typical workload*, there could be around 2 million points generated for each frame, two orders of magnitudes denser than a LiDAR feature point cloud. To lift this efficiency bottleneck, we propose to optimize the BEV pooling with precomputation and interval reduction。

预先计算。BEV池化的第一步是将相机特征点云中的每个点与BEV网格相关联。与LiDAR点云不同的是，相机特征点云的坐标是固定的(只要相机的内在和外在保持一致，经过适当的校准后通常是这样)。基于此，我们预先计算了每个点的三维坐标和BEV网格索引。我们还根据网格索引对所有点进行排序，并记录每个点的排名。在推理过程中，我们只需要根据预先计算的秩对所有的特征点重新排序。这种缓存机制可以将网格关联的延迟从17ms减少到4ms。

Precomputation. The first step of BEV pooling is to associate each point in the camera feature point cloud with a BEV grid. Different from LiDAR point clouds, the coordinates of the camera feature point cloud are fixed (as long as the camera intrinsics and extrinsics stay the same, which is usually the case after proper calibration). Motivated by this, we precompute the 3D coordinate and the BEV grid index of each point. We also sort all points according to grid indices and record the rank of each point. During inference, we only need to reorder all feature points based on the precomputed ranks. This caching mechanism can reduce the latency of grid association from 17ms to 4ms.

间隔的减少。网格关联后，同一BEV网格内的所有点在张量表示中是连续的。BEV池化的下一步是通过一些对称函数(例如mean, max和sum)来聚合每个BEV网格内的特征。如图3b所示，现有的实现[39]首先计算所有点的前缀和，然后减去索引变化的边界处的值。然而，前缀和操作需要在GPU上进行树化简，并产生许多未使用的部分和(因为我们只需要边界上的这些值)，这两种操作都是低效的。为了加速特征聚合，我们实现了一个专门的GPU内核，它直接在BEV网格上并行化:我们为每个网格分配一个GPU线程，计算其间隔和并将结果写回来。该内核消除了输出之间的依赖性(因此不需要多级树缩减)，避免将部分求和写入DRAM，将特征聚合的延迟从500ms减少到2ms(图3c)。

Interval Reduction. After grid association, all points within the same BEV grid will be consecutive in the tensor representation. The next step of BEV pooling is then to aggregate the features within each BEV grid by some symmetric function (e.g., mean, max, and sum). As in Figure 3b, existing implementation [ 39 ] first computes the prefix sum over all points and then subtracts the values at the boundaries where indices change. However, the prefix sum operation requires tree reduction on the GPU and produces many unused partial sums (since we only need those values on the boundaries), both of which are inefficient. To accelerate feature aggregation, we implement a specialized GPU kernel that parallelizes directly over BEV grids: we assign a GPU thread to each grid that calculates its interval sum and writes the result back. This kernel removes the dependency between outputs
(thus does not require multi-level tree reduction) and avoids writing the partial sums to the DRAM, reducing the latency of feature aggregation from 500ms to 2ms (Figure 3c).

思考。通过我们优化的BEV池，相机到BEV的转换速度提高了40倍:延迟从500ms以上减少到12ms(仅占我们模型端到端运行时的10%)，并且在不同的特征分辨率下可以很好地扩展(图3d)。这是在共享BEV表示中统一多模态感官特征的关键因素。我们的两项并行工作也确定了仅相机3D检测的效率瓶颈。它们通过假设均匀深度分布[63]或截断每个BEV网格内的点[20]来近似视图转换器。相比之下，我们的技术是精确的，没有任何近似，同时仍然更快。

Takeaways. The camera-to-BEV transformation is 40× faster with our optimized BEV pooling: the latency is reduced from more than 500ms to 12ms (only 10% of our model’s end-to-end runtime) and scales well across different feature resolutions (Figure 3d). This is a key enabler for unifying multi-modal sensory features in the shared BEV representation. Two concurrent works of ours also identify this efficiency bottleneck in the camera-only 3D detection. They approximate the view transformer by assuming uniform depth distribution [63 ] or truncating the points within each BEV grid [20]. In contrast, our techniques are exact without any approximation, while still being faster.

3.3全卷积神经网络的融合

将所有感官特征转换为共享的BEV表示后，我们可以使用元素运算符(例如串联)轻松地将它们融合在一起。虽然在相同的空间，LiDAR的BEV特征和相机的BEV特征仍然可能在一定程度上由于视图转换器的深度不准确而在空间上不对齐。为此，我们应用基于卷积的BEV编码器(带有一些剩余块)来补偿这种局部失调。我们的方法可能会受益于更准确的深度估计(例如，监督具有地面真值深度的视图转换器[43,38])，我们将其留给未来的工作。

With all sensory features converted to the shared BEV representation, we can easily fuse them together with an elementwise operator (such as concatenation). Though in the same space, LiDAR BEV features and camera BEV features can still be spatially misaligned to some extent due to the inaccurate depth in the view transformer. To this end, we apply a convolution-based BEV encoder (with a few residual blocks) to compensate for such local misalignments. Our method could potentially benefit from more accurate depth estimation (e.g., supervising the view transformer with ground-truth depth [43, 38]), which we leave for future work.

3.4多任务头

我们将多个任务特定的头应用到融合的BEV特征映射中。我们的方法适用于大多数3D感知任务。我们展示了两个例子:3D物体检测和BEV地图分割。检测。我们使用特定于类的中心热图头来预测所有对象的中心位置，并使用一些回归头来估计对象的大小、旋转和速度。我们建议读者参考之前的3D检测论文[1,67,68]了解更多细节。分割。不同的地图类别可能会重叠(例如，人行横道是可驾驶空间的子集)。因此，我们将这个问题表述为多个二元语义分割，每个类一个。我们采用CVT[70]来训练具有标准焦损的分割头[29]。

We apply multiple task-specific heads to the fused BEV feature map. Our method is applicable to most 3D perception tasks. We showcase two examples: 3D object detection and BEV map segmentation. Detection. We use a class-specific center heatmap head to predict the center location of all objects and a few regression heads to estimate the object size, rotation, and velocity. We refer the readers to previous 3D detection papers [1, 67, 68] for more details. Segmentation. Different map categories may overlap (e.g., crosswalk is a subset of drivable space). Therefore, we formulate this problem as multiple binary semantic segmentation, one for each class. We follow CVT [70] to train the segmentation head with the standard focal loss [29].

4.实验部分

我们评估了BEVFusion在3D目标检测和BEV地图分割上的相机-激光雷达融合，涵盖了几何和语义导向的任务。我们的框架可以很容易地扩展到支持其他类型的传感器(如雷达和基于事件的相机)和其他3D感知任务(如3D对象跟踪和运动预测)。

We evaluate BEVFusion for camera-LiDAR fusion on 3D object detection and BEV map segmenta- tion, covering both geometric- and semantic-oriented tasks. Our framework can be easily extended to support other types of sensors (such as radars and event-based cameras) and other 3D perception tasks (such as 3D object tracking and motion forecasting).

模型。我们使用swing - t[32]作为我们的图像骨干，使用VoxelNet[65]作为我们的激光雷达骨干。我们应用FPN[28]融合多尺度相机特征，生成1/8输入大小的特征图。我们将相机图像下采样到256×704，并将LiDAR点云体素化0.075m(用于检测)和0.1m(用于分割)。由于检测和分割任务需要不同空间范围和大小的BEV特征图，我们在每个任务特定头部前采用双线性插值网格采样，在不同BEV特征图之间显式转换。

Model. We use Swin-T [ 32] as our image backbone and VoxelNet [65 ] as our LiDAR backbone. We apply FPN [ 28 ] to fuse multi-scale camera features to produce a feature map of 1/8 input size. We downsample camera images to 256×704 and voxelize the LiDAR point cloud with 0.075m (for detection) and 0.1m (for segmentation). As detection and segmentation tasks require BEV feature maps with different spatial ranges and sizes, we apply grid sampling with bilinear interpolation before each task-specific head to explicitly transform between different BEV feature maps.

训练。与现有的冻结摄像机编码器的方法[54,55,1]不同，我们以端到端的方式训练整个模型。我们同时应用图像和激光雷达数据增强来防止过拟合。使用AdamW[34]进行优化，权重衰减为10−2。

Training. Unlike existing approaches [ 54 , 55 , 1] that freeze the camera encoder, we train the entire model in an end-to-end manner. We apply both image and LiDAR data augmentations to prevent overfitting. Optimization is carried out using AdamW [34] with a weight decay of 10−2.

数据集。我们在nuScenes[2]上评估了我们的方法，nuScenes是一个在CC BY-NC-SA 4.0许可下发布的大型户外数据集。它有多种注释来支持各种任务(如3D对象检测/跟踪和BEV地图分割)。40,157个注释样本中的每个样本包含6个具有360度FoV的单目相机图像和32束激光雷达扫描。

Dataset. We evaluate our method on nuScenes [ 2] which is a large-scale outdoor dataset released under the CC BY-NC-SA 4.0 license. It has diverse annotations to support all sorts of tasks (such as 3D object detection/tracking and BEV map segmentation). Each of the 40,157 annotated samples contains six monocular camera images with 360-degree FoV and a 32-beam LiDAR scan.

图4:BEVFusion在三维物体检测和BEV地图分割上的定性结果。它可以准确地识别远处和小的物体(上)，并分析拥挤的夜间场景(下)。

4.1 3D目标检测

我们首先在以几何为中心的3D物体检测基准上进行了实验，BEVFusion以更低的计算成本和测量延迟实现了卓越的性能。

We first experiment on the geometric-centric 3D object detection benchmark, where BEVFusion achieves superior performance with lower computation cost and measured latency.

设置。我们使用10个前景类的平均精度(mAP)和nuScenes检测分数(NDS)作为我们的检测指标。我们还测量了RTX3090 GPU上所有开源方法的单推理# mac和延迟。对于val和测试结果，我们使用单个模型，没有增加任何测试时间。

Setting. We use the mean average precision (mAP) across 10 foreground classes and the nuScenes detection score (NDS) as our detection metrics. We also measure the single-inference #MACs and latency on an RTX3090 GPU for all open-source methods. We use a single model without any test-time augmentation for both val and test results.

结果。如表1所示，BEVFusion在nuScenes检测基准上取得了最先进的结果，在桌面GPU上具有接近实时(8.4 FPS)的推理速度。与输血相比[1]，BEVFusion在测试分割mAP和NDS方面提高了1.3%，同时显著降低了1.9倍的mac和1.3倍的测量延迟。与具有代表性的点级融合方法PointPainting[54]和MVP[68]相比，BEVFusion在测试集上的加速提高1.6倍，mac减少1.5倍，mAP提高3.8%。我们认为BEV融合的效率提升来自于我们选择了BEV空间作为共享融合空间，它充分利用了所有的相机特征，而不仅仅是5%的稀疏集。因此，BEVFusion可以在更小的mac上实现相同的性能。结合章节3.2中高效的BEV池算子，BEVFusion将mac减少转化为可测量的加速。

Results. As in Table 1, BEVFusion achieves state-of-the-art results on the nuScenes detection benchmark, with close-to-real-time (8.4 FPS) inference speed on a desktop GPU. Compared with TransFusion [ 1], BEVFusion achieve 1.3% improvement in test split mAP and NDS, while sig- nificantly reduces the MACs by 1.9× and measured latency by 1.3×. BEVFusion also compares favorably against representative point-level fusion methods PointPainting [ 54 ] and MVP [68 ] with 1.6× speedup, 1.5× MACs reduction and 3.8% higher mAP on the test set. We argue that the efficiency gain of BEVFusion comes from the fact that we choose the BEV space as the share fusion space, which fully utilizes all camera features instead of just a 5% sparse set. As a result, BEVFusion can achieve the same performance with much smaller MACs. Combined with the efficient BEV pooling operator in Section 3.2, BEVFusion translates MACs reduction into measured speedup.

4.2 BEV地图分割

在以语义为中心的BEV地图分割任务上，我们进一步将BEVFusion与最先进的3D感知模型进行比较，BEVFusion实现了更大的性能提升。

We further compare BEVFusion with state-of-the-art 3D perception models on the semantic-centric BEV map segmentation task, where BEVFusion achieves an even larger performance boost.

设置。我们报告了6个背景类别(可驾驶空间、人行横道、人行道、停车线、停车场和车道分隔线)的十字路口(IoU)和类别平均平均IoU作为我们的评估指标。由于不同的类别可能有重叠(例如停车场区域也是可驾驶的)，我们分别评估每个类别的二值分割性能，并在不同阈值上选择最高的IoU[70]。对于每一帧，我们只在自我汽车跟随[39,70,63,25]周围的[-50m, 50m]×[-50m, 50m]区域进行评估。在BEVFusion中，我们使用单个模型对所有类联合执行二值分割，而不是按照传统方法为每个类训练单独的模型。这使得推理和训练速度提高了6倍。我们复制了所有开源竞争方法的结果。

Setting. We report the Intersection-over-Union (IoU) on 6 background classes (drivable space, pedestrian crossing, walkway, stop line, car-parking area, and lane divider) and the class-averaged mean IoU as our evaluation metric. As different classes may have overlappings (e.g. car-parking area is also drivable), we evaluate the binary segmentation performance for each class separately and select the highest IoU across different thresholds [ 70 ]. For each frame, we only perform the evaluation in the [-50m, 50m]×[-50m, 50m] region around the ego car following [39 , 70 , 63, 25]. In BEVFusion, we use a single model that jointly performs binary segmentation for all classes instead of following the conventional approach to train a separate model for each class. This results in 6× faster inference and training. We reproduced the results of all open-source competing methods

结果。我们在表2中报告了BEV地图分割结果。与面向几何的三维目标检测不同，地图分割是面向语义的。因此，我们的纯相机BEVFusion模型比纯激光雷达基线高出8-13%。这一观察结果与表1中的结果完全相反，表1中仅使用相机的最先进的3D探测器的性能比仅使用激光雷达的探测器高出近20个mAP。我们的相机模型将现有的单目BEV地图分割方法的性能提高了至少12%。在多模态环境中，我们进一步将单目bev融合的性能提高了6 mIoU，达到了13%，对最先进传感器融合方法的改进[54,68]。这是因为两者都是基线方法以对象为中心，面向几何。PointPainting[54]只装饰前景激光雷达点和MVP只能使前景3D物体致密化。两种方法都没有帮助分割映射组件。更糟糕的是，这两种方法都假设激光雷达应该是更的有效模式，根据我们在表2中的观察，这是不正确的。

Results. We report the BEV map segmentation results in Table 2. In contrast to 3D object detection which is a geometric-oriented task, map segmentation is semantic-oriented. As a result, our camera- only BEVFusion model outperforms LiDAR-only baselines by 8-13%. This observation is the exact opposite of results in Table 1, where state-of-the-art camera-only 3D detectors got outperformed by LiDAR-only detectors by almost 20 mAP. Our camera-only model boosts the performance of existing monocular BEV map segmentation methods by at least 12%. In the multi-modality setting, we further improve the performance of the monocular BEVFusion by 6 mIoU and achieved >13% improvement over state-of-the-art sensor fusion methods [ 54 , 68 ]. This is because both baseline methods are object-centric and geometric-oriented. PointPainting [54 ] only decorates the foreground LiDAR points and MVP only densifies foreground 3D objects. Both approaches are not helpful for segmenting map components. Worse still, both methods assume that LiDAR should be the more effective modality in sensor fusion, which is not true according to our observations in Table 2.

图5:在不同的激光雷达稀疏度、物体大小和物体距离下，BEVFusion始终优于最先进的单模态和多模态探测器，特别是在更具挑战性的环境下(即更稀疏的点云、小/远的物体)。

5.分析

我们在不同的情况下对BEVFusion的单模态模型和最先进的多模态模型进行了深入的分析。

We present in-depth analyses of BEVFusion over single-modality models and state-of-the-art multi-modality models under different circumstances.

天气和照明。我们系统地分析了BEVFusion在不同天气和光照条件下的性能，见表3。由于存在明显的传感器噪声，对于仅使用激光雷达的模型来说，在雨天检测物体是一项挑战。由于相机传感器在不同天气下的鲁棒性，BEVFusion将CenterPoint提高了10.7 mAP，缩小了晴天和雨天场景之间的性能差距。光照条件差对检测和分割模型都是一个挑战。在检测方面，与BEVFusion相比，MVP的改进要小得多，因为它需要精确的2D实例分割来生成多模态虚拟点(MVP)。这在黑暗或过度曝光的场景中是非常具有挑战性的(例如，图4中的第二个场景)。对于分割，即使只有相机的BEVFusion在表2中的整个数据集上大大优于CenterPoint，它在夜间的性能也要差得多。我们的BEVFusion将其性能显著提升了12.8 mIoU，这比白天的提升幅度更大，这表明了当相机传感器失效时几何线索的重要性。

Weather and Lighting. We systematically analyze the performance of BEVFusion under different weather and lighting conditions in Table 3. Detecting objects in rainy weather is challenging for LiDAR-only models due to significant sensor noises. Thanks to the robustness of camera sensors under different weathers, BEVFusion improves CenterPoint by 10.7 mAP, closing the performance gap between sunny and rainy scenarios. Poor lighting conditions are challenging for both detection and segmentation models. For detection, MVP achieves a much smaller improvement compared to BEVFusion since it requires accurate 2D instance segmentations to generate multi-modal virtual points (MVPs). This can be very challenging in dark or overexposed scenes (e.g., the second scene of Figure 4). For segmentation, even if the camera-only BEVFusion greatly outperforms CenterPoint on the entire dataset in Table 2, its performance is much worse at nighttime. Our BEVFusion significantly boosts its performance by 12.8 mIoU, which is even larger than the improvement in the daytime, demonstrating the significance of geometric clues when camera sensors fail.

尺寸和距离。我们还分析了不同目标尺寸和距离下的性能。从图5a中可以看出，BEVFusion在大小物体上都比仅使用lidar的同类产品取得了一致的改进，而MVP在大于4米的物体上的改进微不足道。这是因为较大的对象通常密度更大，从增强的多模态虚拟点(mvp)中获益更少。此外，BEVFusion对较小的物体(图5a)和较远的物体(图5b)的LiDAR-only模型带来了更大的改进，两者都是激光雷达覆盖范围较差，因此可以从密集的相机信息中获益更多。

Sizes and Distances. We also analyze the performance under different object sizes and distances. From Figure 5a, BEVFusion achieves consistent improvements over its LiDAR-only counterpart for both small and large objects, while MVP has only negligible improvements for objects larger than 4m. This is because larger objects are typically much denser, benefiting less from those augmented multi-modal virtual points (MVPs). Besides, BEVFusion brings larger improvements to the LiDAR only model for smaller objects (Figure 5a) and more distant objects (Figure 5b), both of which are poorly covered by LiDAR and can therefore benefit more from the dense camera information.

稀疏的激光雷达。我们在图5c中展示了纯激光雷达探测器CenterPoint[67]、多模态探测器MVP[68]和我们的BEVFusion在不同激光雷达稀疏度下的性能。BEVFusion在所有稀疏度级别下均优于MVP, mac减少1.6倍，在单波束激光雷达场景下实现了12%的改进。MVP装饰输入点云，并直接在涂漆和致密的LiDAR输入上应用CenterPoint。因此，它自然要求仅LiDAR的CenterPoint探测器表现良好，这在稀疏LiDAR设置下是无效的(图5c中35.8 NDS, 1光束输入)。相比之下，BEVFusion在共享的BEV空间中融合多感官信息，因此不需要假设只有强大的激光雷达探测器。

Sparser LiDARs. We demonstrate the performance of the LiDAR-only detector CenterPoint [67 ], multi-modality detector MVP [ 68] and our BEVFusion under different LiDAR sparsity in Figure 5c. BEVFusion consistently outperforms MVP under all sparsity levels with 1.6× MACs reduction and achieves a 12% improvement in the 1-beam LiDAR scenario. MVP decorates the input point cloud and directly applies CenterPoint on the painted and densified LiDAR input. Thus, it naturally requires the LiDAR-only CenterPoint detector to perform well, which is not valid under sparse LiDAR settings (35.8 NDS with 1-beam input in Figure 5c). BEVFusion, in contrast, fuses multi-sensory information in a shared BEV space, and thus does not assume a strong LiDAR-only detector

多任务学习。本文关注的是不同任务分别训练的设置。在这里，我们提出了一个联合三维检测和分割训练的试点研究。我们将不同任务的损失重新调整到相同的大小，并为每个任务应用单独的BEV编码器，以提供学习更多任务特定特征的能力。从表5可以看出，不同任务一起联合训练对每个单独任务的绩效产生负向影响，这被广泛称为“负迁移”。分离BEV编码器部分缓解了这个问题。一个更复杂的培训计划可以进一步缩小这一差距，我们将其留给未来的工作。

Multi-Task Learning. This paper focuses on the setting where different tasks are trained separately. Here, we present a pilot study of joint 3D detection and segmentation training.
We re-scale the loss for different tasks to the same magnitude and apply a separate BEV encoder for each task to provide the capability of learning more task-specific features. From
Table 5, jointly training different tasks together has a negative impact on the performance of each individual task, which is widely known as “negative transfer”. Separating BEV encoders partially alleviates this problem. A more sophisticated training scheme could further close this gap, which we leave for future work.

消融实验。我们在表4中给出了消融研究，以证明我们的设计选择是正确的，我们使用较短的检测器训练时间表。在表4a中，我们观察到BEVFusion在纯激光雷达检测(+8.8%)和纯相机分割(+6.1%)方面都有很大的改进。这表明，在共享的BEV空间中，传感器融合对几何和面向语义的任务都是有益的。从表4b、表4c和表4d可以看出，BEVFusion的检测变体在体素和图像分辨率下都具有良好的可扩展性，而当图像分辨率高于256×704时，BEV分割性能趋于平稳。我们还注意到，在表4d中，使用来自1/8输入分辨率的FPN特征为检测和分割提供了最佳性能，进一步增加计算是没有帮助的。表4f表明，我们的BEVFusion是通用的，适用于不同的主干。同样值得注意的是，在现有的多传感器3D物体检测研究中，冻结图像主干的常见做法[54,55,1]即使用于检测，也没有充分利用相机特征提取器的潜力，并且导致BEV分割的性能急剧下降(10%)。我们在表4e中进一步证明，图像和LiDAR输入的增强有助于提高BEVFusion的性能.

Ablation Studies. We present ablation studies in Table 4 to justify our design choices, where we use a shorter training schedule for detectors. In Table 4a, we observe that BEVFusion brings large improvements to both LiDAR-only detection (+8.8%) and camera-only segmentation (+6.1%). This indicates that sensor fusion in a shared BEV space is beneficial for both geometric and semantic-oriented tasks. Table 4b, Table 4c and Table 4d suggest that the detection variant of BEVFusion scales well for both voxel and image resolutions, while the BEV segmentation performance plateaus when the image resolution grows above 256×704. We also notice in Table 4d that using FPN features from 1/8 input resolution provides the best performance for both detection and segmentation and further increasing computation is not helpful. Table 4f indicates that our BEVFusion is general and works well for different backbones. It is also noteworthy that the common practice to freeze the image backbone in existing multi-sensor 3D object detection research [ 54, 55 , 1 ] does not exploit the full potential of the camera feature extractor even for detection, and causes a drastic performance drop (10%) in BEV segmentation. We further demonstrate in Table 4e that augmentations on both image and LiDAR inputs are helpful for improving the performance on BEVFusion。

6.结论

我们提出了BEVFusion，一个高效和通用的框架，用于多任务多传感器3D感知。BEVFusion在共享的BEV空间中统一了摄像头和激光雷达功能，充分保留了几何和语义信息。为了实现这一目标，我们将缓慢的相机到bev的转换加速了40倍以上。BEVFusion打破了长期以来的普遍做法，即点级融合是多传感器感知系统的黄金选择。BEVFusion在3D检测和BEV地图分割任务上实现了最先进的性能，与现有解决方案相比，计算量减少1.5-1.9倍，测量速度提高1.3-1.6倍。我们希望BEVFusion可以作为一个简单但强大的基线，启发未来的多任务多传感器融合研究。

We present BEVFusion, an efficient and generic framework for multi-task multi-sensor 3D perception. BEVFusion unifies camera and LiDAR features in a shared BEV space that fully preserves both geometric and semantic information. To achieve this, we accelerate the slow camera-to-BEV transformation by more than 40×. BEVFusion breaks the long-lasting common practice that point-level fusion is the golden choice for multi-sensor perception systems. BEVFusion achieves state-of-the-art performance on both 3D detection and BEV map segmentation tasks with 1.5-1.9× less computation and 1.3-1.6× measured speedup over existing solutions. We hope that BEVFusion can serve as a simple but powerful baseline to inspire future research on multi-task multi-sensor fusion

局限性。目前，BEVFusion在联合多任务训练中仍然存在性能下降的问题，尚未释放出在多任务环境下进行更大推理加速的潜力。更精确的深度估计[43,38]也是本文未充分探索的方向，可以进一步提高BEVFusion的性能。

Limitations. At present, BEVFusion still has performance degradation in joint multi-task training, which has not yet unlocked the potential for larger inference speedup in the multi-task setting. More accurate depth estimation [43, 38] is also an under-explored direction in this paper that can potentially boost the performance of BEVFusion further。

社会影响。高效、准确的多传感器感知对自动驾驶汽车的安全性至关重要。BEVFusion将最先进的多传感器融合模型的计算成本降低了一半，并在小而远的物体上以及在雨天和夜间条件下实现了很大的精度提高。它为安全和强大的自动驾驶铺平了道路。

Societal Impacts. Efficient and accurate multi-sensor perception is crucial for the safety of autonomous vehicles. BEVFusion reduces the computation cost of state-of-the-art multi-sensor fusion models by half and achieves large accuracy improvements on small and distant objects, and in rainy and night conditions. It paves the way for safe and robust autonomous driving.

感谢。我们要感谢陈轩尧和Brady Zhou在检测和分割评估方面的指导，以及刘英飞和王天才的有益讨论。这项工作得到了美国国家科学基金、现代汽车、高通、NVIDIA和苹果公司的支持。刘志坚获得了高通创新奖学金的部分支持

Acknowledgement. We would like to thank Xuanyao Chen and Brady Zhou for their guidance on detection and segmentation evaluation, and Yingfei Liu and Tiancai Wang for their helpful discussions. This work was supported by National Science Foundation, Hyundai Motor, Qualcomm, NVIDIA and Apple. Zhijian Liu was partially supported by the Qualcomm Innovation Fellowship

参考文献

剩下的参考文献就不截图了。

不解风水

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation（论文原文阅读）

多传感器融合对于精确可靠的自动驾驶系统至关重要。最近的方法是基于点级融合：用相机特征增强激光雷达点云。然而，相机到激光雷达的投影丢弃了相机特征的语义信息，阻碍了这些方法的有效性，特别是对于面向语义的任务(如3D场景分割)。在本文中，我们用BEVFusion打破了这种根深蒂固的惯例，BEVFusion是一种高效且通用的多任务多传感器融合框架。它将多模态特征统一到共享鸟瞰图(BEV)表示空间中，很好地保留了几何信息和语义信息。
复制链接

扫一扫