【双语论文】Joint 3D Proposal Generation and Object Detection from View Aggregation

Joint 3D Proposal Generation and Object Detection from View Aggregation

利用视角聚合进行联合3D候选区域生成和目标检测

Abstract

We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod

摘要

我们提出AVOD,一种用于自动驾驶场景的聚合视图目标检测网络。 所提出的神经网络架构使用LIDAR点云和RGB图像来生成由两个子网共享的特征:区域建议网络(RPN)和第二级检测器网络。 所提出的RPN使用能够在高分辨率特征图上执行多模态特征融合的新颖架构,以为道路场景中的多个目标类别生成可靠的3D目标候选区域。 使用这些候选区域,第二阶段检测网络执行精确的定向的3D边界框回归和类别分类,以预测3D空间中目标的空间大小、方向和类别。 我们提出的架构显示了在KITTI 3D物体检测基准[1]上产生最先进的结果,同时实时运行且内存占用少,使其成为在自动驾驶车辆上部署的合适候选者。代码:https://github.com/kujason/avod

I. Introduction 介绍

The remarkable progress made by deep neural networks on the task of 2D object detection in recent years has not transferred well to the detection of objects in 3D. The gap between the two remains large on standard benchmarks such as the KITTI Object Detection Benchmark [1] where 2D car detectors have achieved over 90% Average Precision (AP), whereas the top scoring 3D car detector on the same scenes only achieves 70% AP. The reason for such a gap stems from the difficulty induced by adding a third dimension to the estimation problem, the low resolution of 3D input data, and the deterioration of its quality as a function of distance. Furthermore, unlike 2D object detection, the 3D object detection task requires estimating oriented bounding boxes (Fig. 1).

深度神经网络近年来在二维物体检测任务上取得的显着进步并没有很好地转移到三维物体的检测中。 两者之间的差距在标准基准测试中仍然很大,例如KITTI物体检测基准测试[1],其中2D汽车检测器已达到超过90%的平均精度(AP),而同一场景下的3D汽车检测器最高得分仅达到70%AP。 产生这种差距的原因在于,对估计问题增加了一个第三维度、3D输入数据分辨率低以及作为距离函数其质量进一步变差而引起的困难。 此外,与2D目标检测不同,3D目标检测任务需要估计带方向的边界框(图1)。
鸟瞰图3D目标检测图示Fig. 1: A visual representation of the 3D detection problem from Bird’s Eye View (BEV). The bounding box in Green is used to determine the IoU overlap in the computation of the average precision. The importance of explicit orientation estimation can be seen as an object’s bounding box does not change when the orientation (purple) is shifted by ±π radians.
图 1:鸟瞰图(BEV)的3D检测问题的直观表示。绿框用于确定平均精度计算中的IoU重叠。显然方向估计的重要性可以看作是当方向(紫色)被弧度移位±π时,目标的边界框不会改变。

Similar to 2D object detectors, most state-of-the-art deep models for 3D object detection rely on a 3D region proposal generation step for 3D search space reduction. Using region proposals allows the generation of high quality detections via more complex and computationally expensive processing at later detection stages. However, any missed instances at the proposal generation stage cannot be recovered during the following stages. Therefore, achieving a high recall during the region proposal generation stage is crucial for good performance.

与2D目标检测任务类似,用于3D目标探测的方法中大多数最先进的深度模型,都依赖于3D区域建议(3DRPN)生成步骤以减少3D搜索空间。 使用区域建议网络允许在其后的检测阶段通过更复杂和计算上耗时的处理方式来生成高质量检测结果。 但是,在区域建议生成阶段的任何被漏检的实例在其后阶段都无法被检测到。 因此,在区域建议生成阶段(RPN)实现高召回率对于实现好的检测结果至关重要。

Region proposal networks (RPNs) were proposed in Faster-RCNN [2], and have become the prevailing proposal generators in 2D object detectors. RPNs can be considered a weak amodal detector, providing proposals with high recall and low precision. These deep architectures are attractive as they are able to share computationally expensive convolutional feature extractors with other detection stages. However, extending these RPNs to 3D is a non-trivial task. The Faster R-CNN RPN architecture is tailored for dense, high resolution image input, where objects usually occupy more than a couple of pixels in the feature map. When considering sparse and low resolution input such as the Front View [3] or Bird’s Eye View (BEV) [4] point cloud projections, this method is not guaranteed to have enough information to generate region proposals, especially for small object classes.

区域建议网络(RPNs)是在Faster-RCNN [2]中被提出的,并且已经成为2D目标探测器中的主流区域建议生成器。 RPNs 可被视为弱结构的检测器,提供高召回率和低精度的区域建议。 这些深层架构很有吸引力,因为它们能够与其他检测阶段共享计算代价高昂的卷积特征提取器。 但是,将这些RPNs扩展到3D是一项不容易的任务。 Faster R-CNN中的RPN架构专为密集的高分辨率图像输入而定制, 在考虑稀疏和低分辨率输入(如前视图[3]或鸟瞰视图(BEV)[4]点云投影)时,此方法无法保证有足够的信息来生成区域建议,尤其是对于小目标类别。

In this paper, we aim to resolve these difficulties by proposing AVOD, an Aggregate View Object Detection architecture for autonomous driving (Fig. 2). The proposed architecture delivers the following contributions:

在本文中,我们旨在通过提出AVOD(一种用于自动驾驶的聚合视图对象检测架构)来解决这些困难(图2)。所提出的架构贡献如下:

  • Inspired by feature pyramid networks (FPNs) [5] for 2D object detection, we propose a novel feature extractor that produces high resolution feature maps from LIDAR point clouds and RGB images, allowing for the localization of small classes in the scene.
  • 受特征金字塔网络(FPNs)的启发[5],我们提出了一个新颖的特征提取器,能够从激光雷达点云和RGB图像生成高分辨率特征映射,从而可以定位自动驾驶场景中的小目标类别。
  • We propose a feature fusion Region Proposal Network (RPN) that utilizes multiple modalities to produce highrecall region proposals for small classes.
  • 我们提出了一个特征融合区域建议网络(RPN),其利用多模态为小目标类别生成高召回率的区域建议。
  • We propose a novel 3D bounding box encoding that conforms to box geometric constraints, allowing for higher 3D localization accuracy.
  • 我们提出了一个新颖的3D边缘盒编码方案,其遵从边缘盒几何约束,允许更高的3D定位精度。
  • The proposed neural network architecture exploits 1x1 convolutions at the RPN stage, along with a fixed look-up table of 3D anchor projections, allowing high computational speed and a low memory footprint while maintaining detection performance.
  • 所提出的神经网络架构在RPN阶段利用1x1卷积,伴随着一个3D锚框投影的固定查找表,从而保持检测性能的同时能够实现高计算速度和低内存占用。
    方案结构图
    Fig. 2: The proposed method’s architectural diagram. The feature extractors are shown in blue, the region proposal network in pink, and the second stage detection network in green.
    如图2所示:所提出的方案结构图。蓝色表示特征提取器,粉色表示区域建议网络,绿色表示第二阶段检测网络。

The above contributions result in an architecture that delivers state-of-the-art detection performance at a low computational cost and memory footprint. Finally, we integrate the network into our autonomous driving stack, and show generalization to new scenes and detection under more extreme weather and lighting conditions, making it a suitable candidate for deployment on autonomous vehicles.

上述贡献使得所提出的架构能够以低计算成本和低内存占用获得最先进的检测性能。最后,我们将网络集成到我们的自动驾驶堆栈中,并在更极端的天气和光照条件下展示对新场景的检测和泛化能力,使其成为在自动驾驶车辆上部署的合适候选者。

II. Related Work 相关工作

Hand Crafted Features For Proposal Generation: Before the emergence of 3D Region Proposal Networks (RPNs) [2], 3D proposal generation algorithms typically used handcrafted features to generate a small set of candidate boxes that retrieve most of the objects in 3D space. 3DOP [6] and Mono3D [7] uses a variety of hand-crafted geometric features from stereo point clouds and monocular images to score 3D sliding windows in an energy minimization framework. The top K scoring windows are selected as region proposals, which are then consumed by a modified Fast-RCNN [?] to generate the final 3D detections. We use a region proposal network that learns features from both BEV and image spaces to generate higher quality proposals in an efficient manner.

用于区域提议生成的手工特征: 在3D区域建议网络(RPNs)[2]出现之前,3D区域建议生成算法通常使用手工制作的特征来生成一个小的候选框集合,用于检索3D空间中的大多数目标。 3DOP[6]和Mono3D[7]使用立体点云和单目图像中的各种手工制作的几何特征,通过3D滑动窗口的方式在能量最小化框架中计算得分。选择得分前K的窗口作为区域建议,然后给修改的Fast-RCNN[?]使用,以生成最终的3D检测。我们使用区域建议网络来学习BEV和图像空间的特征,以有效的方式生成更高质量的区域建议。

Proposal Free Single Shot Detectors: Single shot object detectors have also been proposed as RPN free architectures for the 3D object detection task. VeloFCN [3] projects a LIDAR point cloud to the front view, which is used as an input to a fully convolutional neural network to directly generate dense 3D bounding boxes. 3D-FCN [8] extends this concept by applying 3D convolutions on 3D voxel grids constructed from LIDAR point clouds to generate better 3D bounding boxes. Our two-stage architecture uses an RPN to retrieve most object instances in the road scene, providing better results when compared to both of these single shot methods. VoxelNet [9] extends 3D-FCN further by encoding voxels with point-wise features instead of occupancy values. However, even with sparse 3D convolution operations, VoxelNet’s computational speed is still slower than our proposed architecture, which provides better results on the car and pedestrian classes.

无区域建议的单发检测器: 单发目标检测器也被提出用于无RPN架构3D物体检测任务。 VeloFCN[3]将LIDAR点云投射到前视图,将该视图用作全卷积神经网络的输入,以直接生成密集的3D边界框。 3D-FCN[8]通过在由LIDAR点云构造的3D体素网格上应用3D卷积来扩展这一概念,以生成更好的3D边界框。 我们的两阶段架构使用RPN来检索道路场景中的大多数目标实例,与那些单步方法相比,可以提供更好的结果。 VoxelNet [10]通过使用逐点特征而不是占用值编码体素来进一步扩展3D-FCN。 然而,即使使用稀疏3D卷积操作,VoxelNet的计算速度仍然比我们提出的架构慢3倍,这也为汽车和行人类提供了更好的结果。

Monocular-Based Proposal Generation: Another direction in the state-of-the-art is using mature 2D object detectors for proposal generation in 2D, which are then extruded to 3D through amodal extent regression. This trend started with[10] for indoor object detection, which inspired Frustum-based PointNets (F-PointNet) [11] to use point-wise features of PointNet [12] instead of point histograms for extent regression. While these methods work well for indoor scenes and brightly lit outdoor scenes, they are expected to perform poorly in more extreme outdoor scenarios. Any missed 2D detections will lead to missed 3D detections and therefore, the generalization capability of such methods under such extreme conditions has yet to be demonstrated. LIDAR data is much less variable than image data and we show in Section IV that AVOD is robust to noisy LIDAR data and lighting changes, as it was tested in snowy scenes and in low light conditions.

基于单目的区域建议生成: 现有先进技术中的另一个方向是使用成熟的2D物体检测器在2D中生成区域建议,然后通过模态扩展回归将其扩张到3D。 这种趋势始于[10],用于室内物体检测,它启发了基于截头锥体(Frustum)的PointNets(F-PointNet)[11],使用PointNet [12]的逐点特征代替点直方图进行范围回归。 虽然这些方法适用于室内场景和明亮的室外场景,但预计它们在更极端的室外场景中表现不佳。 任何错过的2D检测都将导致错过3D检测,因此,在这种极端条件下这些方法的泛化能力尚未得到证实。 LIDAR 数据的变化远小于图像数据,我们在第IV节中表明,AVOD 对于嘈杂的LIDAR数据和光照变化具有鲁棒性,因为它在雪景和低光条件下进行了测试。

Monocular-Based 3D Object Detectors: Another way to utilize mature 2D object detectors is to use prior knowledge to perform 3D object detection from monocular images only. Deep MANTA [13] proposes a many-task vehicle analysis approach from monocular images that optimizes region proposal, detection, 2D box regression, part localization, part visibility, and 3D template prediction simultaneously. The architecture requires a database of 3D models corresponding to several types of vehicles, making the proposed approach hard to generalize to classes where such models do not exist. Deep3DBox [14] proposes to extend 2D object detectors to 3D by exploiting the fact that the perspective projection of a 3D bounding box should fit tightly within its 2D detection window. However, in Section IV, these methods are shown to perform poorly on the 3D detection task compared to methods that use point cloud data.

基于单目的3D物体检测器: 利用成熟的2D物体检测器的另一种方式是使用先验知识仅从单眼图像执行3D物体检测。 深层MANTA[14]提出了一种基于单目图像的多任务车辆分析方法,可同时优化区域建议、检测、2D盒回归、部件定位、部件可视性和3D模板预测。 该架构需要有与几种类型的车辆相对应的3D模型的数据库,使得所提出的方法难以生成模型中不存在的类。 Deep3DBox[15]提出通过利用3D边界框的透视投影应紧密地适合其2D检测窗口这一事实,将2D目标检测扩展到3D。 但是,在第IV节中,与使用点云数据的方法相比,这些方法在3D检测任务上表现不佳。

3D Region Proposal Networks: 3D RPNs have previously been proposed in [15] for 3D object detection from RGBD images. However, up to our knowledge, MV3D [4] is the only architecture that proposed a 3D RPN targeted at autonomous driving scenarios. MV3D extends the image based RPN of Faster R-CNN [2] to 3D by corresponding every pixel in the BEV feature map to multiple prior 3D anchors. These anchors are then fed to the RPN to generate 3D proposals that are used to create view-specific feature crops from the BEV, front view of [3], and image view feature maps. A deep fusion scheme is used to combine information from these feature crops to produce the final detection output. However, this RPN architecture does not work well for small object instances in BEV. When downsampled by convolutional feature extractors, small instances will occupy a fraction of a pixel in the final feature map, resulting in insufficient data to extract informative features. Our RPN architecture aims to fuse full resolution feature crops from the image and the BEV feature maps as inputs to the RPN, allowing the generation of high recall proposals for smaller classes. Furthermore, our feature extractor provides full resolution feature maps, which are shown to greatly help in localization accuracy for small objects during the second stage of the detection framework.

3D区域建议网络: 以前在[15]中提出了用于从RGBD图像进行3D目标检测的3D RPN。然而,根据我们的了解,MV3D[4]是唯一提出针对自动驾驶场景的3D RPN的架构。 MV3D通过将BEV特征图中的每个像素对应多个先前的3D锚点,将基于图像的Faster R-CNN[2]中的RPN扩展到3D。然后将这些锚点馈送到RPN以生成3D区域建议,用于从BEV创建视图特定的特征裁剪,[3]的前视图和图像视图特征映射。深度融合方案用于组合来自这些特征裁剪的信息以产生最终的检测输出。但是,此RPN结构不适用于BEV中的小目标实例。当通过卷积特征提取器进行下采样时,小实例将占据最终特征映射中的一小部分像素,导致数据不足以提取信息特征。我们的RPN架构旨在融合图像中的全分辨率特征裁剪和BEV特征图作为RPN的输入,从而为尺寸较小的类生成高召回的区域建议。此外,我们的特征提取器提供了全分辨率的特征映射,这些特征映射在检测框架的第二阶段显示出了对小目标的定位准确性极大地帮助。

II

  • 5
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值