MIT-BEVFusion系列四-2:BEVFusion:Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View论文中英文对照(第三章)

端木的AI探索屋

已于 2024-02-01 13:16:11 修改

阅读量1.1k

点赞数 19

分类专栏： bevfusion 文章标签：自动驾驶算法 python

于 2024-01-31 16:44:28 首次发布

本文链接：https://blog.csdn.net/duanmushuangquan/article/details/135955560

版权

bevfusion 专栏收录该内容

18 篇文章 18 订阅

订阅专栏

三、Method

在这里插入图片描述
BEVFusion focuses on multi-sensor fusion (i.e., multi-view cameras and LiDAR) for multi-task 3D perception (i.e., detection and segmentation).
BEVFusion专注于多任务3D感知（即，检测和分割）的多传感器融合（即，多视角摄像头和激光雷达）。

We provide an overview of our framework in Figure 2.
我们在图2中提供了我们框架的概览。
在这里插入图片描述

Given different sensory inputs, we first apply modality-specific encoders to extract their features.
考虑到不同的感觉输入，我们首先应用特定于模态的编码器来提取它们的特征。

We transform multi-modal features into a unified BEV representation that preserves both geometric and semantic information.
我们将多模态特征转换为一个统一的BEV表示，该表示同时保留几何和语义信息。

We identify the efficiency bottleneck of the view transformation and accelerate BEV pooling with precomputation and interval reduction.
我们识别了视图转换的效率瓶颈，并通过预计算和间隔缩减来加速BEV池化。

We then apply the convolution-based BEV encoder to the unified BEV features to alleviate the local misalignment between different features.
然后，我们将基于卷积的BEV编码器应用于统一的BEV特征，以减轻不同特征之间的局部错位。

Finally, we append a few task-specific heads to support different 3D tasks.
最后，我们添加了一些特定于任务的头部以支持不同的3D任务。
在这里插入图片描述

3.1 Unified Representation

Different features can exist in different views. For instance, camera features are in the perspective view, while LiDAR/radar features are typically in the 3D/bird’s-eye view.
不同的特征可以存在于不同的视图中。例如，摄像头特征位于透视视图中，而激光雷达/雷达特征通常位于3D/鸟瞰视图中。

Even for camera features, each one of them has a distinct viewing angle (i.e., front, back, left, right).
即使对于摄像头特征，每一个都有一个不同的观察角度（即，前、后、左、右）。

This view discrepancy makes the feature fusion difficult since the same element in different feature tensors might correspond to completely different spatial locations (and the naïve elementwise feature fusion will not work in this case).
这种视图差异使得特征融合变得困难，因为不同特征张量中的相同元素可能对应于完全不同的空间位置（并且在这种情况下，简单的逐元素特征融合将不起作用）。

Therefore, it is crucial to find a shared representation, such that (1) all sensor features can be easily converted to it without information loss, and (2) it is suitable for different types of tasks.
因此，找到一个共享的表示方式至关重要，以便（1）所有传感器特征都可以轻易地转换为它而不会丢失信息，以及（2）它适用于不同类型的任务。
To Camera
To Camera. Motivated by RGB-D data, one choice is to project the LiDAR point cloud to the camera plane and render the 2.5D sparse depth. However, this conversion is geometrically lossy.
对于摄像头：受RGB-D数据的启发，一种选择是将LiDAR点云投影到摄像头平面并渲染2.5D稀疏深度。然而，这种转换在几何上是有损的。

Two neighbors on the depth map can be far away from each other in the 3D space. This makes the camera view less effective for tasks that focus on the object/scene geometry, such as 3D object detection.
深度图上的两个相邻点在3D空间中可能相距很远。这使得摄像头视图对于关注对象/场景几何的任务（如3D对象检测）效果不佳。
To LiDAR
To LiDAR. Most state-of-the-art sensor fusion methods [54, 68, 23] decorate LiDAR points with their corresponding camera features (e.g., semantic labels, CNN features or virtual points). However, this camera-to-LiDAR projection is semantically lossy.
对于LiDAR：大多数最先进的传感器融合方法[54, 68, 23]用相应的摄像头特征（如语义标签、CNN特征或虚拟点）装饰LiDAR点。然而，这种摄像头到LiDAR的投影在语义上是有损的。

Camera and LiDAR features have drastically different densities, resulting in only less than 5% of camera features being matched to a LiDAR point (for a 32-channel LiDAR scanner). Giving up the semantic density of camera features severely hurts the model’s performance on semantic-oriented tasks (such as BEV map segmentation).
摄像头和LiDAR特征具有截然不同的密度，导致不到5%的摄像头特征能与LiDAR点匹配（对于32通道的LiDAR扫描仪）。放弃摄像头特征的语义密度会严重影响模型在面向语义的任务（如BEV地图分割）上的性能。

Similardrawbacks also apply to more recent fusion methods in the latent space (e.g., object query) [8, 1].
类似的缺点也适用于最近在潜在空间中的融合方法（例如，对象查询）[8, 1]。
To Bird’s-Eye View.
To Bird’s-Eye View. We adopt the bird’s-eye view (BEV) as the unified representation for fusion. This view is friendly to almost all perception tasks since the output space is also in BEV.

对于鸟瞰视图：我们采用鸟瞰视图（BEV）作为融合的统一表示。这种视图对几乎所有感知任务都是友好的，因为输出空间也在BEV中。

More importantly, the transformation to BEV keeps both geometric structure (from LiDAR features) and semantic density (from camera features).

更重要的是，转换到BEV保留了几何结构（来自LiDAR特征）和语义密度（来自摄像头特征）。

On the one hand, the LiDAR-to-BEV projection flattens the sparse LiDAR features along the height dimension, thus does not create geometric distortion in Figure 1a.

一方面，LiDAR到BEV的投影沿高度维度压平了稀疏的LiDAR特征，因此不会在图1a中产生几何失真。

On the other hand, camera-to-BEV projection casts each camera feature pixel back into a ray in the 3D space (detailed in the next section), which can result in a dense BEV feature map in Figure 1c that retains full semantic information from the cameras.

另一方面，摄像头到BEV的投影将每个摄像头特征像素投射回3D空间中的一条射线（在下一节中详细介绍），这可以导致图1c中有一个密集的BEV特征图，保留了摄像头的全部语义信息。
在这里插入图片描述

3.2 Efficient Camera-to-BEV Transformation

Camera-to-BEV transformation is non-trivial because the depth associated with each camera feature pixel is inherently ambiguous.
摄像头到BEV的转换并非简单的问题，因为与每个摄像头特征像素相关联的深度本质上是模糊的。

LIDAR点云转换成BEV，相对容易

本身就是3D数据
输入空间和结果空间在同一个空间，是空间上对齐的
CAM换成BEV，较难
相机没有直接的深度信息。
转换到共享3D空间困难(也是因为没有深度信息)
输入输出模态不一致。即输入是6个CAM的6张图。输出是一张BEV图。

Following LSS [39] and BEVDet [20, 19], we explicitly predict the discrete depth distribution of each pixel.
遵循LSS [39]和BEVDet [20, 19]，我们明确预测每个像素的离散深度分布。
https://boardmix.cn/app/share/CAE.CL230gwgASoQcyGisv_mQIuJj-XzAp77pzAGQAE/Nbd3k2
在这里插入图片描述
2020年 LSS图2008.05711.pdf (arxiv.org) 关键词：离散深度分布、外积、视锥

外积示意：(118， 32， 88) 与 ( 32， 88， 80)外积，变成(118， 32, 88， 80)
代码中：(6, 1, 118, 32, 88) * (6, 80, 1, 32, 88)

在这里插入图片描述
2103.01100.pdf (arxiv.org) 2021年 CaDDN

We then scatter each feature pixel into D discrete points along the camera ray and rescale the associated features by their corresponding depth probabilities (Figure 3a).
然后，我们将每个特征像素散布到沿摄像头射线的D个离散点上，并根据相应的深度概率重新缩放相关特征（见图3a）。
在这里插入图片描述

This generates a camera feature point cloud of size NHW D, where N is the number of cameras and (H, W) is the camera feature map size.
这生成了一个大小为NHW D的摄像头特征点云，其中N是摄像头的数量，(H, W)是摄像头特征图的大小。

Such 3D feature point cloud is quantized along the x, y axes with a step size of r (e.g., 0.4m).
这样的3D特征点云沿x、y轴进行量化，步长为r（例如，0.4m）。

We use the BEV pooling operation to aggregate all features within each r × r BEV grid and flatten the features along the z-axis.
我们使用BEV池化操作来聚合每个r × r BEV网格内的所有特征，并沿z轴压平特征。
在这里插入图片描述

Though simple, BEV pooling is surprisingly inefficient and slow, taking more than 500ms on an RTX 3090 GPU (while the rest of our model only takes around 100ms).
尽管简单，BEV池化却出奇地低效和缓慢，在RTX 3090 GPU上需要超过500毫秒（而我们模型的其余部分只需要大约100毫秒）。

This is because the camera feature point cloud is very large: for a typical workload*, there could be around 2 million points generated for each frame, two orders of magnitudes denser than a LiDAR feature point cloud.

这是因为摄像头特征点云非常大：对于一个典型的工作负载，每一帧可能会生成大约200万个点，比LiDAR特征点云密集两个数量级。*

在这里插入图片描述

To lift this efficiency bottleneck, we propose to optimize the BEV pooling with precomputation and interval reduction.
为了解决这个效率瓶颈，我们建议通过预计算和区间缩减来优化BEV池化。

Precomputation

- Precomputation. The first step of BEV pooling is to associate each point in the camera feature point cloud with a BEV grid.
  预计算。BEV池化的第一步是将摄像头特征点云中的每一个点与一个BEV网格关联起来。
- Different from LiDAR point clouds, the coordinates of the camera feature point cloud are fixed (as long as the camera intrinsics and extrinsics stay the same, which is usually the case after proper calibration).
  与LiDAR点云不同，摄像头特征点云的坐标是固定的（只要摄像头的内参和外参保持不变，这通常在适当的校准后就是这样）。
- Motivated by this, we precompute the 3D coordinate and the BEV grid index of each point.
  受此启发，我们预先计算每个点的3D坐标和BEV网格索引。
- We also sort all points according to grid indices and record the rank of each point.
  我们还根据网格索引对所有点进行排序，并记录每个点的排名。
- During inference, we only need to reorder all feature points based on the precomputed ranks.
  在推理过程中，我们只需要根据预先计算的排名重新排序所有特征点。
- This caching mechanism can reduce the latency of grid association from 17ms to 4ms.
  这种缓存机制可以将网格关联的延迟从17毫秒降低到4毫秒。

Interval Reduction.

在这里插入图片描述
项目中的代码

在这里插入图片描述
可视化解析

- Interval Reduction. After grid association, all points within the same BEV grid will be consecutive in the tensor representation.
  区间缩减。在网格关联之后，同一个BEV网格内的所有点在张量表示中将是连续的。
- The next step of BEV pooling is then to aggregate the features within each BEV grid by some symmetric function (e.g., mean, max, and sum).
  接下来的BEV池化步骤是通过某种对称函数（例如，平均值、最大值和总和）来聚合每个BEV网格内的特征。
- As in Figure 3b, existing implementation [39] first computes the prefix sum over all points and then subtracts the values at the boundaries where indices change.
  如图3b所示，现有的实现[39]首先计算所有点的前缀和，然后在索引发生变化的边界处减去这些值。
- However, the prefix sum operation requires tree reduction on the GPU and produces many unused partial sums, both of which are inefficient.
  然而，前缀和操作需要在GPU上进行树形规约，并产生许多未使用的部分和，这两者都是低效的。
- To accelerate feature aggregation, we implement a specialized GPU kernel that parallelizes directly over BEV grids: we assign a GPU thread to each grid that calculates its interval sum and writes the result back.
  为了加速特征聚合，我们实现了一个专门的GPU内核，该内核直接在BEV网格上并行化：我们为每个网格分配一个GPU线程，该线程计算其区间和并将结果写回。
- This kernel removes the dependency between outputs and avoids writing the partial sums to the DRAM, reducing the latency of feature aggregation from 500ms to 2ms.
  这个内核消除了输出之间的依赖性，并避免了将部分和写入DRAM，从而将特征聚合的延迟从500ms降低到2ms。

Takeaways.

- Takeaways. The camera-to-BEV transformation is 40× faster with our optimized BEV pooling: the latency is reduced from more than 500ms to 12ms and scales well across different feature resolutions.
  要点。使用我们优化的BEV池化，相机到BEV的转换速度提高了40倍：延迟从500ms以上降低到12ms，并且在不同的特征分辨率上都能很好地扩展。
- This is a key enabler for unifying multi-modal sensory features in the shared BEV representation.
  这是在共享BEV表示中统一多模态感觉特征的关键推动因素。
- Two concurrent works of ours also identify this efficiency bottleneck in the camera-only 3D detection.
  我们的两项并行工作也识别出了仅使用相机进行3D检测中的这一效率瓶颈。
- They approximate the view transformer by assuming uniform depth distribution [63] or truncating the points within each BEV grid [20].
  它们通过假设均匀深度分布[63]或截断每个BEV网格内的点[20]来近似视图变换器。
- In contrast, our techniques are exact without any approximation, while still being faster.
  相比之下，我们的技术是精确的，没有任何近似，同时还更快。

3.3 Fully-Convolutional Fusion

3.3 全卷积融合

With all sensory features converted to the shared BEV representation, we can easily fuse them together with an elementwise operator (such as concatenation).
将所有感觉特征转换为共享的BEV表示后，我们可以轻松地使用逐元素运算符（例如连接）将它们融合在一起。

Though in the same space, LiDAR BEV features and camera BEV features can still be spatially misaligned to some extent due to the inaccurate depth in the view transformer.
尽管在同一空间中，由于视图变换器中的深度不准确，LiDAR BEV特征和相机BEV特征仍然可能在一定程度上存在空间错位。

To this end, we apply a convolution-based BEV encoder (with a few residual blocks) to compensate for such local misalignments.
为此，我们应用基于卷积的BEV编码器（带有几个残差块）来补偿这种局部错位。

Our method could potentially benefit from more accurate depth estimation (e.g., supervising the view transformer with ground-truth depth [43, 38]), which we leave for future work.
我们的方法可能会从更准确的深度估计中受益（例如，使用ground-truth真实深度[43, 38]监督视图变换器），这是我们未来工作的内容。

3.4 Multi-Task Heads

3.4 多任务头部

We apply multiple task-specific heads to the fused BEV feature map. Our method is applicable to most 3D perception tasks.
我们将多个任务特定的头部应用于融合的BEV特征图。我们的方法适用于大多数3D感知任务。

We showcase two examples: 3D object detection and BEV map segmentation.
我们展示两个示例：3D对象检测和BEV地图分割。

Detection. We use a class-specific center heatmap head to predict the center location of all objects and a few regression heads to estimate the object size, rotation, and velocity.
检测。我们使用特定类别的中心热图头部来预测所有对象的中心位置，并使用几个回归头部来估计对象的大小、旋转和速度。

We refer the readers to previous 3D detection papers [1, 67, 68] for more details.
有关更多详细信息，我们参考了以前的3D检测论文[1, 67, 68]。

Segmentation. Different map categories may overlap (e.g., crosswalk is a subset of drivable space). Therefore, we formulate this problem as multiple binary semantic segmentation, one for each class.
分割。不同的地图类别可能会重叠（例如，人行道是可行驶空间的子集）。因此，我们将这个问题表述为多个二元语义分割，每个类别一个。

We follow CVT [70] to train the segmentation head with the standard focal loss [29].
我们遵循CVT [70]，使用标准的焦点损失[29]来训练分割头部。

总结

各传感器特点

相机：cameras capture rich semantic information摄像头捕捉丰富的语义信息
Lidar： LiDARs provide accurate spatial information激光雷达提供准确的空间信息
RADAR：radars offer instant velocity estimation雷达提供即时的速度估计

为啥要融合

处理多传感器的问题上 multi-sensor fusion is of great importance for accurate and reliable perception多传感器融合对于准确和可靠的感知非常重要
不同传感器的数据以根本不同的方式表示摄像头以透视视图捕捉数据，而激光雷达以3D视图捕捉数据
To resolve this view discrepancy, we have to find a unified representation that is suitable for multi-task multi-modal feature fusion.为了解决这种视图差异，我们必须找到一个适用于多任务多模态特征融合的统一表示。
such that (1) all sensor features can be easily converted to it without information loss, and (2) it is suitable for different types of tasks.以便（1）所有传感器特征都可以轻易地转换为它而不会丢失信息，以及（2）它适用于不同类型的任务。

VT部分关键效率瓶颈 efficiency bottlenecks

BEV pooling实现了一个专门的GPU内核该内核直接在BEV网格上并行化

To Camera

LiDAR点云投影到摄像头平面，使用2D检测。有损失，关注对象/场景几何的任务效果会不佳

To LiDAR

某种做法，LiDAR点使用2D图像增加点云数据。使用3D检测，语义有损，二者语义密度不同，两个特征匹配点少。放弃摄像头特征的语义密度会严重影响模型在面向语义的任务。

transformation to BEV keeps both geometric structure (from LiDAR features) and semantic density转换到BEV保留了几何结构（来自LiDAR特征）和语义密度（来自摄像头特征）
naturally support most 3D perception tasks自然地支持大多数3D感知任务（因为它们的输出空间可以自然地在BEV中捕获）
LiDAR-to-BEV projection flattens the sparse LiDAR features along the height dimension, thus does not create geometric distortion in Figure 1a.LiDAR到BEV的投影沿高度维度压平了稀疏的LiDAR特征，因此不会在图1a中产生几何失真
camera-to-BEV projection casts each camera feature pixel back into a ray in the 3D space (detailed in the next section)摄像头到BEV的投影将每个摄像头特征像素投射回3D空间中的一条射线
- retains full semantic information from the cameras导致图1c中有一个密集的BEV特征图，保留了摄像头的全部语义信息

BEVFusion 特点

task-agnostic任务无关
seamlessly supports different 3D perception tasks with almost no architectural changes几乎不需要架构更改就能无缝支持不同的3D感知任务

Fully-Convolutional Fusion全卷积融合

Though in the same space, LiDAR BEV features and camera BEV features can still be spatially misaligned to some extent due to the inaccurate depth in the view transformer.尽管在同一空间中，由于视图变换器中的深度不准确，LiDAR BEV特征和相机BEV特征仍然可能在一定程度上存在空间错位
we apply a convolution-based BEV encoder (with a few residual blocks) to compensate for such local misalignments.为此，我们应用基于卷积的BEV编码器（带有几个残差块）来补偿这种局部错位。

Multi-Task Heads

两个特定任务头：3D对象检测和BEV地图分割。

端木的AI探索屋

关注

19
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
MIT-BEVFusion系列四-2:BEVFusion:Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View论文中英文对照(第三章)

各传感器特点相机：cameras capture rich semantic information摄像头捕捉丰富的语义信息Lidar： LiDARs provide accurate spatial information激光雷达提供准确的空间信息RADAR：radars offer instant velocity estimation雷达提供即时的速度估计为啥要融合。
复制链接

扫一扫

专栏目录