https://arxiv.org/pdf/2205.13542.pdf
零、Abstract
Multi-sensor fusion is essential for an accurate and reliable autonomous driving system.
多传感器融合对于准确和可靠的自动驾驶系统至关重要。
Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features.
最近的方法基于点级融合:用摄像头特征增强激光雷达点云。
However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation).
然而,摄像头到激光雷达的投影 丢弃了 摄像头特征的语义密度,这 阻碍了 这种方法的有效性,特别是对于语义导向的任务(如3D场景分割)。
In this paper, we break this deeply-rooted convention with BEVFusion, an efficient and generic multi-task multi-sensor fusion framework.
在本文中,我们用BEVFusion打破了这一根深蒂固的传统,BEVFusion是一个高效和通用的多任务多传感器融合框架。
It unifies multi-modal features in the shared bird’s-eye view (BEV) representation space, which nicely preserves both geometric and semantic information.
它在共享的鸟瞰视图(BEV)表示空间中统一了多模态特征,从而 很好地保留了 几何和语义信息。
To achieve this, we diagnose and lift key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than 40×.
为了实现这一点,我们通过优化BEV Pooling, 来诊断和提升视图转换中的关键效率瓶颈,从而将 延迟降低了40倍以上。
BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes.
BevFusion从根本上来说是与任务无关的,并且几乎不需要架构更改就能无缝支持不同的3D感知任务。
It establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9× lower computation cost.
它在nuScenes上达到了SOTA,实现了3D物体检测上1.3%更高的mAP和NDS,以及BEV地图分割上13.6%更高的mIoU,计算成本降低了1.9倍。
Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
用于复现我们结果的代码可在https://github.com/mit-han-lab/bevfusion 上找到。
一、Introduction
小结1:
自动驾驶系统配备了多种传感器,提供互补的信号,所以多传感器融合对于准确和可靠的感知非常重要
Autonomous driving systems are equipped with diverse sensors.
自动驾驶系统配备了多种传感器。
For instance, Waymo’s self-driving vehicles have 29 cameras, 6 radars, and 5 LiDARs.
例如,Waymo的自动驾驶车辆有29个摄像头,6个雷达,和5个激光雷达。
Different sensors provide complementary signals.
不同的传感器提供互补的信号。
E.g., cameras capture rich semantic information, LiDARs provide accurate spatial information, while radars offer instant velocity estimation.
例如,摄像头捕捉丰富的语义信息,激光雷达提供准确的空间信息,而雷达<