点云+图像融合3D目标检测的时序方法

BEVFusion4D

https://arxiv.org/abs/2303.17099

Integrating LiDAR and Camera information into Bird'sEye-View (BEV) has become an essential topic for 3D object detection in autonomous driving. Existing methods mostly adopt an independent dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion. Since point clouds provide more accurate localization and geometry information, they could serve as a reliable spatial prior to acquiring relevant semantic information from the images. Therefore, we design a LiDAR-Guided View Transformer (LGVT) to effectively obtain the camera representation in BEV space and thus benefit the whole dual-branch fusion system. LGVT takes camera BEV as the primitive semantic query, repeatedly leveraging the spatial cue of LiDAR BEV for extracting image features across multiple camera views. Moreover, we extend our framework into the temporal domain with our proposed Temporal Deformable Alignment (TDA) module, which aims to aggregate BEV features from multiple historical frames. Including these two modules, our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection, with 72.0% mAP and 73.5% NDS on the nuScenes validation set, and 73.3% mAP and 74.7% NDS on nuScenes test set, respectively.

将激光雷达和相机信息集成到Bird'sEye-View(BEV)中已成为自动驾驶中3D物体检测的一个重要课题。现有的方法大多采用独立的双分支框架来生成LiDAR和相机BEV,然后进行自适应模态融合。由于点云提供了更准确的定位和几何信息,因此它们可以在从图像中获取相关语义信息之前作为可靠的空间。因此,我们设计了一种激光雷达引导视图变换器(LGVT),以有效地获得BEV空间中的相机表示,从而有利于整个双分支融合系统。LGVT将相机BEV作为原始语义查询,反复利用LiDAR BEV的空间线索在多个相机视图中提取图像特征。此外,我们使用我们提出的时间可变形对齐(TDA)模块将我们的框架扩展到时域,该模块旨在从多个历史帧中聚合BEV特征。包括这两个模块,我们称为BEVFusion4D的框架在3D对象检测方面取得了最先进的结果,在nuScenes验证集上分别有72.0%的mAP和73.5%的NDS,在nuScenes测试集上分别为73.3%的mAP与74.7%的NDS。

3.3. Temporal BEV Features Fusion

The target's historical location and orientation information is beneficial for its current motion estimation. Besides, temporal information can also help detect remote, near or occluded objects in current time. Thus BEVFusion4D employ temporal fusion on the historical BEV features. In detail, in a recurrent way, previous BEV feature is first calibrated with the current one according to ego-motion information, then moving objects between two frames are further aligned in the temporal deformable alignment module to obtain the temporal fused BEV feature.

目标的历史位置和方向信息有利于其当前的运动估计。此外,时间信息还可以帮助检测当前时间内的远程、近距离或遮挡物体。因此,BEVFusion4D对历史BEV特征进行了时间融合。详细地说,以循环的方式,首先根据自我运动信息用当前BEV特征校准之前的BEV特征,然后在时间可变形对齐模块中进一步对齐两帧之间的运动对象,以获得时间融合的BEV特性。

Ego Motion Calibration. Fig. 5 shows the architecture of our temporal BEV features fusion module. Fused BEV features stored in chronological order are denoted as{F T −n, ..., F T }. BEVFusion4D aligns each BEV featureF t with the next BEV features F t+1 based on the motion information of the ego vehicle in temporal. This calibration process can be described by the following formula:

自我运动校准。图5显示了我们的时间BEV特征融合模块的架构。按时间顺序存储的融合BEV特征表示为。BEVFusion4D基于自车辆在时间上的运动信息,将每个BEV特征Ft与下一个BEV特征F t+1对齐。该校准过程可以用以下公式描述:

The matrix Kt→t+1 are termed as ego-motion transformation matrix aligned with F t+1 at time t.

The calibration process described by Eq. 7 is limited to aligning stationary objects due to the lack of motion information. Consequently, moving objects may exhibit motion smear without any calibration, and as historical BEV features increase, the motion smear effect can become more pronounced.

由于缺乏运动信息,方程7描述的校准过程仅限于对齐静止物体。因此,运动物体可能会在没有任何校准的情况下出现运动拖尾,随着历史BEV特征的增加,运动拖尾效应可能会变得更加明显。

Temporal Deformable Alignment. To alleviate this challenge, we recurrently adopt the deformable attention mechanism to establish the correspondence between consecutive frames. The deformable attention mechanism is designed to adaptively learn the receptive field, thus enabling it to effectively capture the salient features of moving objects in the aligned BEV features. By exploiting the deformable attention mechanism, we can significantly reduce the motion smear and enhance the accuracy of the alignment process. We propose the Temporal Deformable Alignment(TDA) to conduct temporal fusion specific to moving objects.

时间可变形对齐。为了缓解这一挑战,我们反复采用可变形注意力机制来建立连续帧之间的对应关系。可变形注意力机制旨在自适应地学习感受野,从而使其能够在对齐的BEV特征中有效地捕捉运动物体的显著特征。通过利用可变形的注意力机制,我们可以显著减少运动拖尾,提高对齐过程的准确性。我们提出了时间可变形对齐(TDA)来进行特定于运动对象的时间融合

We denote  ̃F , ˆF as BEV features calibrated by Eq. 7 and temporal fused by TDA, respectively. Firstly, TDA concatenate two consecutive calibrated frames  ̃F t−1update and  ̃F tas  ̃F t−1,t. Then the deformable attention mechanism is applied to obtain ˆF t−1 and ˆF t by using  ̃F t−1,t as the query and  ̃F t−1,  ̃F t as values. Next, TDA computes the average of ˆF t−1 and ˆF t element-wise to update frame  ̃F t by adding it to frame  ̃F t. The updated frame  ̃F t update can then be used for the fusion of the subsequent BEV feature. The process can be formulated as follows:

我们将̃F、ˆF分别表示为通过方程7校准的BEV特征和通过TDA进行时间融合的BEV特性。首先,TDA将两个连续的校准帧̃F t−1update和\771 F tas M F ts−1,t连接起来。然后应用可变形注意力机制,以̃。接下来,TDA计算ˆF t−1和ˆ于F t元素的平均值,通过将其添加到帧̃F t.来更新帧\771 F t。然后,更新的帧M F t-update可用于后续BEV特征的融合。该过程可表述如下:

FusionFormer

https://arxiv.org/abs/2309.05257

https://github.com/ADLab-AutoDrive/FusionFormer (代码暂时没有开源)

Multi-sensor modal fusion has demonstrated strong advantages in 3D object detection tasks. However, existing methods that fuse multi-modal features through a simple channel concatenation require transformation features into bird's eye view space and may lose the information on Z-axis thus leads to inferior performance. To this end, we propose FusionFormer, an end-to-end multi-modal fusion framework that leverages transformers to fuse multi-modal features and obtain fused BEV features. And based on the flexible adaptability of FusionFormer to the input modality representation, we propose a depth prediction branch that can be added to the framework to improve detection performance in camera-based detection tasks. In addition, we propose a plug-and-play temporal fusion module based on transformers that can fuse historical frame BEV features for more stable and reliable detection results. We evaluate our method on the nuScenes dataset and achieve 72.6% mAP and 75.1% NDS for 3D object detection tasks, outperforming state-of-the-art methods.

多传感器模态融合在3D目标检测任务中显示出强大的优势。然而,通过简单的通道连接融合多模态特征的现有方法需要将特征转换为鸟瞰空间,并可能丢失Z轴上的信息,从而导致性能不佳。为此,我们提出了FusionFormer,这是一个端到端的多模态融合框架,利用变换器融合多模态特征并获得融合的BEV特征。基于FusionFormer对输入模态表示的灵活适应性,我们提出了一种深度预测分支,可以添加到框架中,以提高基于相机的检测任务中的检测性能。此外,我们提出了一种基于Transformer的即插即用时间融合模块,可以融合历史帧BEV特征,以获得更稳定可靠的检测结果。我们在nuScenes数据集上评估了我们的方法,并在3D物体检测任务中实现了72.6%的mAP和75.1%的NDS,优于最先进的方法。

3.3 TEMPORAL FUSION ENCODER

As shown in Figure 3, the Temporal Fusion Encoder (TFE) consists of three layers, each comprising BEV temporal-attention and feedforward networks. At the first layer, the queries are initialized with the BEV features of the current frame and then updated through temporal-attention using historical BEV features. The resulting queries are passed through a feedforward network and serve as input to the next layer. After three layers of fusion encoding, the final temporal fusion BEV feature is obtained. The temporal-attention process can be expressed as:

如图3所示,时间融合编码器(TFE)由三层组成,每层都包括BEV时间注意力和前馈网络。在第一层,使用当前帧的边界元法特征初始化查询,然后使用历史边界元法功能通过时间注意力进行更新。生成的查询通过前馈网络传递,并作为下一层的输入。经过三层融合编码,得到最终的时间融合边界元特征。时间注意力过程可以表示为:

where Bt−i represents the BEV feature at time t − i.

  • 14
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值