【双语论文】Joint 3D Proposal Generation and Object Detection from View Aggregation

Joint 3D Proposal Generation and Object Detection from View Aggregation

利用视角聚合进行联合3D候选区域生成和目标检测

Abstract

We present AVOD, an Aggregate View Object Detection network for autonomous driving scenarios. The proposed neural network architecture uses LIDAR point clouds and RGB images to generate features that are shared by two subnetworks: a region proposal network (RPN) and a second stage detector network. The proposed RPN uses a novel architecture capable of performing multimodal feature fusion on high resolution feature maps to generate reliable 3D object proposals for multiple object classes in road scenes. Using these proposals, the second stage detection network performs accurate oriented 3D bounding box regression and category classification to predict the extents, orientation, and classification of objects in 3D space. Our proposed architecture is shown to produce state of the art results on the KITTI 3D object detection benchmark [1] while running in real time with a low memory footprint, making it a suitable candidate for deployment on autonomous vehicles. Code is at: https://github.com/kujason/avod

摘要

我们提出AVOD,一种用于自动驾驶场景的聚合视图目标检测网络。 所提出的神经网络架构使用LIDAR点云和RGB图像来生成由两个子网共享的特征:区域建议网络(RPN)和第二级检测器网络。 所提出的RPN使用能够在高分辨率特征图上执行多模态特征融合的新颖架构,以为道路场景中的多个目标类别生成可靠的3D目标候选区域。 使用这些候选区域,第二阶段检测网络执行精确的定向的3D边界框回归和类别分类,以预测3D空间中目标的空间大小、方向和类别。 我们提出的架构显示了在KITTI 3D物体检测基准[1]上产生最先进的结果,同时实时运行且内存占用少,使其成为在自动驾驶车辆上部署的合适候选者。代码:https://github.com/kujason/avod

I. Introduction 介绍

The remarkable progress made by deep neural networks on the task of 2D object detection in recent years has not transferred well to the detection of objects in 3D. The gap between the two remains large on standard benchmarks such as the KITTI Object Detection Benchmark [1] where 2D car detectors have achieved over 90% Average Precision (AP), whereas the top scoring 3D car detector on the same scenes only achieves 70% AP. The reason for such a gap stems from the difficulty induced by adding a third dimension to the estimation problem, the low resolution of 3D input data, and the deterioration of its quality as a function of distance. Furthermore, unlike 2D object detection, the 3D object detection task requires estimating oriented bounding boxes (Fig. 1).

深度神经网络近年来在二维物体检测任务上取得的显着进步并没有很好地转移到三维物体的检测中。 两者之间的差距在标准基准测试中仍然很大,例如KITTI物体检测基准测试[1],其中2D汽车检测器已达到超过90%的平均精度(AP),而同一场景下的3D汽车检测器最高得分仅达到70%AP。 产生这种差距的原因在于,对估计问题增加了一个第三维度、3D输入数据分辨率低以及作为距离函数其质量进一步变差而引起的困难。 此外,与2D目标检测不同,3D目标检测任务需要估计带方向的边界框(图1)。
鸟瞰图3D目标检测图示Fig. 1: A visual representation of the 3D detection problem from Bird’s Eye View (BEV). The bounding box in Green is used to determine the IoU overlap in the computation of the average precision. The importance of explicit orientation estimation can be seen as an object’s bounding box does not change when the orientation (purple) is shifted by ±π radians.
图 1:鸟瞰图(BEV)的3D检测问题的直观表示。绿框用于确定平均精度计算中的IoU重叠。显然方向估计的重要性可以看作是当方向(紫色)被弧度移位±π时,目标的边界框不会改变。

Similar to 2D object detectors, most state-of-the-art deep models for 3D object detection rely on a 3D region proposal generation step for 3D search space reduction. Using region proposals allows the generation of high quality detections via more complex and computationally expensive processing at later detection stages. However, any missed instances at the proposal generation stage cannot be recovered during the following stages. Therefore, achieving a high recall during the region proposal generation stage is crucial for good performance.

与2D目标检测任务类似,用于3D目标探测的方法中大多数最先进的深度模型,都依赖于3D区域建议(3DRPN)生成步骤以减少3D搜索空间。 使用区域建议网络允许在其后的检测阶段通过更复杂和计算上耗时的处理方式来生成高质量检测结果。 但是,在区域建议生成阶段的任何被漏检的实例在其后阶段都无法被检测到。 因此,在区域建议生成阶段(RPN)实现高召回率对于实现好的检测结果至关重要。

Region proposal networks (RPNs) were proposed in Faster-RCNN [2], and have become the prevailing proposal generators in 2D object detectors. RPNs can be considered a weak amodal detector, providing proposals with high recall and low precision. These deep architectures are attractive as they are able to share computationally expensive convolutional feature extractors with other detection stages. However, extending these RPNs to 3D is a non-trivial task. The Faster R-CNN RPN architecture is tailored for dense, high resolution image input, where objects usually occupy more than a couple of pixels in the feature map. When considering sparse and low resolution input such as the Front View [3] or Bird’s Eye View (BEV) [4] point cloud projections, this method is not guaranteed to have enough information to generate region proposals, especially for small object classes.

区域建议网络(RPNs)是在Faster-RCNN [2]中被提出的,并且已经成为2D目标探测器中的主流区域建议生成器。 RPNs 可被视为弱结构的检测器,提供高召回率和低精度的区域建议。 这些深层架构很有吸引力,因为它们能够与其他检测阶段共享计算代价高昂的卷积特征提取器。 但是,将这些RPNs扩展到3D是一项不容易的任务。 Faster R-CNN中的RPN架构专为密集的高分辨率图像输入而定制, 在考虑稀疏和低分辨率输入(如前视图[3]或鸟瞰视图(BEV)[4]点云投影)时,此方法无法保证有足够的信息来生成区域建议,尤其是对于小目标类别。

In this paper, we aim to resolve these difficulties by proposing AVOD, an Aggregate View Object Detection architecture for autonomous driving (Fig. 2). The proposed architecture delivers the following contributions:

在本文中,我们旨在通过提出AVOD(一种用于自动驾驶的聚合视图对象检测架构)来解决这些困难(图2)。所提出的架构贡献如下:

  • Inspired by feature pyramid networks (FPNs) [5] for 2D object detection, we propose a novel feature extractor that produces high resolution feature maps from LIDAR point clouds and RGB images, allowing for the localization of small classes in the scene.
  • 受特征金字塔网络(FPNs)的启发[5],我们提出了一个新颖的特征提取器,能够从激光雷达点云和RGB图像生成高分辨率特征映射,从而可以定位自动驾驶场景中的小目标类别。
  • We propose a feature fusion Region Proposal Network (RPN) that utilizes multiple modalities to produce highrecall region proposals for small classes.
  • 我们提出了一个特征融合区域建议网络(RPN),其利用多模态为小目标类别生成高召回率的区域建议。
  • We propose a novel 3D bounding box encoding that conforms to box geometric constraints, allowing for higher 3D localization accuracy.
  • 我们提出了一个新颖的3D边缘盒编码方案,其遵从边缘盒几何约束,允许更高的3D定位精度。
  • The proposed neural network architecture exploits 1x1 convolutions at the RPN stage, along with a fixed look-up table of 3D anchor projections, allowing high computational speed and a low memory footprint while maintaining detection performance.
  • 所提出的神经网络架构在RPN阶段利用1x1卷积,伴随着一个3D锚框投影的固定查找表,从而保持检测性能的同时能够实现高计算速度和低内存占用。
    方案结构图
    Fig. 2: The proposed method’s architectural diagram. The feature extractors are shown in blue, the region proposal network in pink, and the second stage detection network in green.
    如图2所示:所提出的方案结构图。蓝色表示特征提取器,粉色表示区域建议网络,绿色表示第二阶段检测网络。

The above contributions result in an architecture that delivers state-of-the-art detection performance at a low computational cost and memory footprint. Finally, we integrate the network into our autonomous driving stack, and show generalization to new scenes and detection under more extreme weather and lighting conditions, making it a suitable candidate for deployment on autonomous vehicles.

上述贡献使得所提出的架构能够以低计算成本和低内存占用获得最先进的检测性能。最后,我们将网络集成到我们的自动驾驶堆栈中,并在更极端的天气和光照条件下展示对新场景的检测和泛化能力,使其成为在自动驾驶车辆上部署的合适候选者。

II. Related Work 相关工作

Hand Crafted Features For Proposal Generation: Before the emergence of 3D Region Proposal Networks (RPNs) [2], 3D proposal generation algorithms typically used handcrafted features to generate a small set of candidate boxes that retrieve most of the objects in 3D space. 3DOP [6] and Mono3D [7] uses a variety of hand-crafted geometric features from stereo point clouds and monocular images to score 3D sliding windows in an energy minimization framework. The top K scoring windows are selected as region proposals, which are then consumed by a modified Fast-RCNN [?] to generate the final 3D detections. We use a region proposal network that learns features from both BEV and image spaces to generate higher quality proposals in an efficient manner.

用于区域提议生成的手工特征: 在3D区域建议网络(RPNs)[2]出现之前,3D区域建议生成算法通常使用手工制作的特征来生成一个小的候选框集合,用于检索3D空间中的大多数目标。 3DOP[6]和Mono3D[7]使用立体点云和单目图像中的各种手工制作的几何特征,通过3D滑动窗口的方式在能量最小化框架中计算得分。选择得分前K的窗口作为区域建议,然后给修改的Fast-RCNN[?]使用,以生成最终的3D检测。我们使用区域建议网络来学习BEV和图像空间的特征,以有效的方式生成更高质量的区域建议。

Proposal Free Single Shot Detectors: Single shot object detectors have also been proposed as RPN free architectures for the 3D object detection task. VeloFCN [3] projects a LIDAR point cloud to the front view, which is used as an input to a fully convolutional neural network to directly generate dense 3D bounding boxes. 3D-FCN [8] extends this concept by applying 3D convolutions on 3D voxel grids constructed from LIDAR point clouds to generate better 3D bounding boxes. Our two-stage architecture uses an RPN to retrieve most object instances in the road scene, providing better results when compared to both of these single shot methods. VoxelNet [9] extends 3D-FCN further by encoding voxels with point-wise features instead of occupancy values. However, even with sparse 3D convolution operations, VoxelNet’s computational speed is still slower than our proposed architecture, which provides better results on the car and pedestrian classes.

无区域建议的单发检测器: 单发目标检测器也被提出用于无RPN架构3D物体检测任务。 VeloFCN[3]将LIDAR点云投射到前视图,将该视图用作全卷积神经网络的输入,以直接生成密集的3D边界框。 3D-FCN[8]通过在由LIDAR点云构造的3D体素网格上应用3D卷积来扩展这一概念,以生成更好的3D边界框。 我们的两阶段架构使用RPN来检索道路场景中的大多数目标实例,与那些单步方法相比,可以提供更好的结果。 VoxelNet [10]通过使用逐点特征而不是占用值编码体素来进一步扩展3D-FCN。 然而,即使使用稀疏3D卷积操作,VoxelNet的计算速度仍然比我们提出的架构慢3倍,这也为汽车和行人类提供了更好的结果。

Monocular-Based Proposal Generation: Another direction in the state-of-the-art is using mature 2D object detectors for proposal generation in 2D, which are then extruded to 3D through amodal extent regression. This trend started with[10] for indoor object detection, which inspired Frustum-based PointNets (F-PointNet) [11] to use point-wise features of PointNet [12] instead of point histograms for extent regression. While these methods work well for indoor scenes and brightly lit outdoor scenes, they are expected to perform poorly in more extreme outdoor scenarios. Any missed 2D detections will lead to missed 3D detections and therefore, the generalization capability of such methods under such extreme conditions has yet to be demonstrated. LIDAR data is much less variable than image data and we show in Section IV that AVOD is robust to noisy LIDAR data and lighting changes, as it was tested in snowy scenes and in low light conditions.

基于单目的区域建议生成: 现有先进技术中的另一个方向是使用成熟的2D物体检测器在2D中生成区域建议,然后通过模态扩展回归将其扩张到3D。 这种趋势始于[10],用于室内物体检测,它启发了基于截头锥体(Frustum)的PointNets(F-PointNet)[11],使用PointNet [12]的逐点特征代替点直方图进行范围回归。 虽然这些方法适用于室内场景和明亮的室外场景,但预计它们在更极端的室外场景中表现不佳。 任何错过的2D检测都将导致错过3D检测,因此,在这种极端条件下这些方法的泛化能力尚未得到证实。 LIDAR 数据的变化远小于图像数据,我们在第IV节中表明,AVOD 对于嘈杂的LIDAR数据和光照变化具有鲁棒性,因为它在雪景和低光条件下进行了测试。

Monocular-Based 3D Object Detectors: Another way to utilize mature 2D object detectors is to use prior knowledge to perform 3D object detection from monocular images only. Deep MANTA [13] proposes a many-task vehicle analysis approach from monocular images that optimizes region proposal, detection, 2D box regression, part localization, part visibility, and 3D template prediction simultaneously. The architecture requires a database of 3D models corresponding to several types of vehicles, making the proposed approach hard to generalize to classes where such models do not exist. Deep3DBox [14] proposes to extend 2D object detectors to 3D by exploiting the fact that the perspective projection of a 3D bounding box should fit tightly within its 2D detection window. However, in Section IV, these methods are shown to perform poorly on the 3D detection task compared to methods that use point cloud data.

基于单目的3D物体检测器: 利用成熟的2D物体检测器的另一种方式是使用先验知识仅从单眼图像执行3D物体检测。 深层MANTA[14]提出了一种基于单目图像的多任务车辆分析方法,可同时优化区域建议、检测、2D盒回归、部件定位、部件可视性和3D模板预测。 该架构需要有与几种类型的车辆相对应的3D模型的数据库,使得所提出的方法难以生成模型中不存在的类。 Deep3DBox[15]提出通过利用3D边界框的透视投影应紧密地适合其2D检测窗口这一事实,将2D目标检测扩展到3D。 但是,在第IV节中,与使用点云数据的方法相比,这些方法在3D检测任务上表现不佳。

3D Region Proposal Networks: 3D RPNs have previously been proposed in [15] for 3D object detection from RGBD images. However, up to our knowledge, MV3D [4] is the only architecture that proposed a 3D RPN targeted at autonomous driving scenarios. MV3D extends the image based RPN of Faster R-CNN [2] to 3D by corresponding every pixel in the BEV feature map to multiple prior 3D anchors. These anchors are then fed to the RPN to generate 3D proposals that are used to create view-specific feature crops from the BEV, front view of [3], and image view feature maps. A deep fusion scheme is used to combine information from these feature crops to produce the final detection output. However, this RPN architecture does not work well for small object instances in BEV. When downsampled by convolutional feature extractors, small instances will occupy a fraction of a pixel in the final feature map, resulting in insufficient data to extract informative features. Our RPN architecture aims to fuse full resolution feature crops from the image and the BEV feature maps as inputs to the RPN, allowing the generation of high recall proposals for smaller classes. Furthermore, our feature extractor provides full resolution feature maps, which are shown to greatly help in localization accuracy for small objects during the second stage of the detection framework.

3D区域建议网络: 以前在[15]中提出了用于从RGBD图像进行3D目标检测的3D RPN。然而,根据我们的了解,MV3D[4]是唯一提出针对自动驾驶场景的3D RPN的架构。 MV3D通过将BEV特征图中的每个像素对应多个先前的3D锚点,将基于图像的Faster R-CNN[2]中的RPN扩展到3D。然后将这些锚点馈送到RPN以生成3D区域建议,用于从BEV创建视图特定的特征裁剪,[3]的前视图和图像视图特征映射。深度融合方案用于组合来自这些特征裁剪的信息以产生最终的检测输出。但是,此RPN结构不适用于BEV中的小目标实例。当通过卷积特征提取器进行下采样时,小实例将占据最终特征映射中的一小部分像素,导致数据不足以提取信息特征。我们的RPN架构旨在融合图像中的全分辨率特征裁剪和BEV特征图作为RPN的输入,从而为尺寸较小的类生成高召回的区域建议。此外,我们的特征提取器提供了全分辨率的特征映射,这些特征映射在检测框架的第二阶段显示出了对小目标的定位准确性极大地帮助。

III. THE AVOD ARCHITECTURE AVOD 架构

The proposed method, depicted in Fig. 2, uses feature extractors to generate feature maps from both the BEV map and the RGB image. Both feature maps are then used by the RPN to generate non-oriented region proposals, which are passed to the detection network for dimension refinement, orientation estimation, and category classification.

本文提出的方法,如图2所示,使用特征提取器从BEV图和RGB图像生成特征映射。然后,RPN使用这两个特征图来生成非定向区域建议,这些proposal被传递到检测网络以进行维度细化、方向估计和类别分类。

A. Generating Feature Maps from Point Clouds and Images
We follow the procedure described in [4] to generate a six-channel BEV map from a voxel grid representation of the point cloud at 0.1 meter resolution. The point cloud is cropped at [-40,40] × [0,70] meters to contain points within the field of view of the camera. The first 5 channels of the BEV map are encoded with the maximum height of points in each grid cell, generated from 5 equal slices between [0,2.5] meters along the Z axis. The sixth BEV channel contains point density information computed per cell as min ⁡ ( 1.0 , log ⁡ ( N + 1 ) log ⁡ 16 ) \min\left(1.0, \frac{\log\left(N+1\right )}{\log16} \right ) min(1.0,log16log(N+1)), where N is the number of points in the cell.

A. 从点云和图像生成特征映射
我们按照[4]中描述的程序从0.1米分辨率的点云的体素网格表示生成六通道BEV图。点云在[-40,40] × [0,70]米处被裁剪,以包含摄像机视野内的点。BEV图的前5个通道使用每个网格单元中的点的最大高度进行编码,从沿Z轴的[0,2.5]米之间的5个等分切片生成。第六个BEV通道包含了每个单元通过公式 min ⁡ ( 1.0 , log ⁡ ( N + 1 ) log ⁡ 16 ) \min\left(1.0, \frac{\log\left(N+1\right )}{\log16} \right ) min(1.0,log16log(N+1))计算的点密度信息,其中N是单元中的点数。

B. The Feature Extractor 特征提取器
The proposed architecture uses two identical feature extractor architectures, one for each input view. The full-resolution feature extractor is shown in Fig. 3 and is comprised of two segments: an encoder and a decoder. The encoder is modeled after VGG-16 [16] with some modifications, mainly a reduction of the number of channels by half, and cutting the network at the conv-4 layer. The encoder therefore takes as an input an M × N × D image or BEV map, and produces an M 8 \frac{M}{8} 8M x N 8 \frac{N}{8} 8N x D ∗ D^{*} D feature map F. F has high representational power, but is 8× lower in resolution compared to the input. An average pedestrian in the KITTI dataset occupies 0.8×0.6 meters in the BEV. This translates to an 8×6 pixel area in a BEV map with 0.1 meter resolution. Downsampling by 8× results in these small classes to occupy less than one pixel in the output feature map, that is without taking into account the increase in receptive field caused by convolutions. Inspired by the Feature Pyramid Network (FPN) [5], we create a bottom-up decoder that learns to upsample the feature map back to the original input size, while maintaining run time speed. The decoder takes as an input the output of the encoder, F, and produces a new M M M × N N N × D ~ \widetilde{D} D feature map. Fig. 3 shows the operations performed by the decoder, which include upsampling of the input via a conv-transpose operation, concatenation of a corresponding feature map from the encoder, and finally fusing the two via a 3 × 3 convolution operation. The final feature map is of high resolution and representational power, and is shared by both the RPN and the second stage detection network.

所提出的架构是使用两个相同的特征提取器结构,每个输入一个视角(RGB或BEV)。全分辨率特征提取器如图3所示,包括两个部分:编码器和解码器。编码器在进行了一些修改的VGG-16[17]之后建模,主要是将通道数减少一半,并在conv-4层切断网络。因此编码器将M × N × D的图像或BEV图作为输入,并产生 M 8 \frac{M}{8} 8M x N 8 \frac{N}{8} 8N x D ∗ D^{*} D的特征图F. F具有高层语义表达能力,但与输入相比分辨率低8倍。在BEV中,KITTI数据集中的行人大小平均占0.8 × 0.6米。这转换到BEV图中对应分辨率为0.1米的8×6的像素区域。下采样8倍导致这些小目标类在输出特征图中占据不到一个像素,这没有考虑由卷积引起的感受野的增加。受特征金字塔网络(FPN)[18]的启发,我们创建了一个自下而上的解码器,学习将特征映射上采样回原始输入大小,同时保持运行时速度。解码器将编码器的输出F作为输入,并产生新的 M M M × N N N × D ~ \widetilde{D} D 特征图。图3展示了由解码器执行的操作,其包括通过卷积反转操作对输入进行上采样,来自编码器的对应特征图的级联,以及最后通过3 × 3卷积操作来对两者进行融合。最终的特征图具有高分辨率和代表性的能量,并且由RPN和第二级检测网络共享。
全分辨率特征提取器
Fig. 3: The architecture of our proposed high resolution feature extractor shown here for the image branch. Feature maps are propagated from the encoder to the decoder section via red arrows. Fusion is then performed at every stage of the decoder by a learned upsampling layer, followed by concatenation, and then mixing via a convolutional layer, resulting in a full resolution feature map at the last layer of the decoder.
图3:我们提出的用于图像分支的高分辨率特征提取器的网络结构。特征图通过红色箭头从编码器传播到解码器部分。然后通过学习的上采样层在解码器的每个阶段执行融合,接着进行连接,然后通过卷积层进行特征融合,最终在解码器的最后一层产生全分辨率特征图。

C. Multimodal Fusion Region Proposal Network
Similar to 2D two-stage detectors, the proposed RPN regresses the difference between a set of prior 3D boxes and the ground truth. These prior boxes are referred to as anchors, and are encoded using the axis aligned bounding box encoding shown in Fig. 4. Anchor boxes are parameterized by the centroid (tx, ty, tz) and axis aligned dimensions (dx, dy, dz). To generate the 3D anchor grid, (tx, ty) pairs are sampled at an interval of 0.5 meters in BEV, while tz is determined based on the sensor’s height above the ground plane. The dimensions of the anchors are determined by clustering the training samples for each class. Anchors without 3D points in BEV are removed efficiently via integral images resulting in 80 − 100K non-empty anchors per frame.

C. 多模态融合区域建议网络
类似于2D两阶段检测器,所提出的RPN回归了一组先验3D框与真实值之间的差异。 这些先验的框被称为锚框,并且使用图4中所示的轴对齐的边界框编码来编码。锚框通过中心(tx​,ty​,tz​)和轴对齐的宽度(dx​,dy​,dz​)来参数化。 为了生成3D锚网格,在BEV中以0.5米的间隔对(tx​,ty​)对进行采样,而tz​根据传感器在地平面上方的高度来确定。 通过聚类每个类的训练样本来确定锚的每一维的尺寸。通过整合图像, 在BEV中不存在的3D点的锚点被有效地移除,最终每帧生成80-100K个非空锚点。
边界框编码
Fig. 4: A visual comparison between the 8 corner box encoding proposed in [4], the axis aligned box encoding proposed in [15], and our 4 corner encoding.
图. 4:[4] 中提出的8角点编码盒、[15]中提出的坐标对齐编码盒和我们的4角点编码的可视化比较。

Extracting Feature Crops Via Multiview Crop And Resize Operations: To extract feature crops for every anchor from the view specific feature maps, we use the crop and resize operation [17]. Given an anchor in 3D, two regions of interest are obtained by projecting the anchor onto the BEV and image feature maps. The corresponding regions are then used to extract feature map crops from each view, which are then bilinearly resized to 3 × 3 to obtain equal-length feature vectors. This extraction method results in feature crops that abide by the aspect ratio of the projected anchor in both views, providing a more reliable feature crop than the 3 × 3 convolution used originally by Faster-RCNN.

通过多视图裁剪和调整大小操作提取特征裁剪: 要从特定视图的特征映射中提取每个锚点的特征裁剪,我们使用裁剪和调整大小操作[19]。 给定3D中的锚点,通过将锚点投影到BEV和图像特征图上来获得两个感兴趣区域。 然后使用相应的区域从每个视图中获得裁剪结果,然后将其双线性地调整为3×3以获得等长的特征向量。这种提取方法使得特征裁剪保持两个视图中投影锚点的纵横比,提供比Faster-RCNN最初使用的3×3卷积更可靠的特征裁剪。

Dimensionality Reduction Via 1 × 1 Convolutional Layers: In some scenarios, the region proposal network is required to save feature crops for 100K anchors in GPU memory. Attempting to extract feature crops directly from high dimensional feature maps imposes a large memory overhead per input view. As an example, extracting 7 × 7 feature crops for 100K anchors from a 256-dimensional feature map requires around 5 gigabytes1 of memory assuming 32-bit floating point representation. Furthermore, processing such high-dimensional feature crops with the RPN greatly increases its computational requirements.

通过1×1卷积层降低维度: 在某些情况下,区域提议网络需要在GPU内存中保存100K锚点的特征裁剪。 尝试直接从高维特征映射中提取特征裁剪会使每个输入视图产生大量内存开销。 例如,从256维特征映射中提取100K锚点的7×7特征裁剪需要大约5千兆字节的内存,假设32位浮点表示。 此外,利用RPN处理这种高维特征裁剪极大地增加了其计算需求。

Inspired by their use in [18], we propose to apply a 1 × 1 convolutional kernel on the output feature maps from each view, as an efficient dimensionality reduction mechanism that learns to select features that contribute greatly to the performance of the region proposal generation. This reduces the memory overhead for computing anchor specific feature crops by D ~ \widetilde{D} D ×, allowing the RPN to process fused features of tens of thousands of anchors using only a few megabytes of additional memory.

受[18]中使用的启发,我们提出应用1×1卷积核到每个视图输出的特征图上,作为一种有效的降维机制,能够学习去选择对区域建议生成贡献更大性能的特征。这减少了对于 D ~ \widetilde{D} D ×所对应的特定锚框的特征裁剪的内存占用,使得RPN能够以仅仅少量的额外内存增加处理成千上万的锚框的融合特征。

3D Proposal Generation: The outputs of the crop and resize operation are equal-sized feature crops from both views, which are fused via an element-wise mean operation. Two task specific branches [2] of fully connected layers of size 256 use the fused feature crops to regress axis aligned object proposal boxes and output an object/background “objectness” score. 3D box regression is performed by computing (∆tx, ∆ty, ∆tz, ∆dx, ∆dy, ∆dz), the difference in centroid and dimensions between anchors and ground truth bounding boxes. Smooth L1 loss is used for 3D box regression, and cross-entropy loss for “objectness”. Similar to [2], background anchors are ignored when computing the regression loss. Background anchors are determined by calculating the 2D IoU in BEV between the anchors and the ground truth bounding boxes. For the car class, anchors with IoU less than 0.3 are considered background anchors, while ones with IoU greater than 0.5 are considered object anchors. For the pedestrian and cyclist classes, the object anchor IoU threshold is reduced to 0.45. To remove redundant proposals, 2D non-maximum suppression (NMS) at an IoU threshold of 0.8 in BEV is used to keep the top 1024 proposals during training. At inference time, 300 proposals are used for the car class, whereas 1024 proposals are kept for pedestrians and cyclists.

3D区域建议生成: 裁剪和调整大小操作的输出是来自两个视图的大小相同的特征裁剪,它们通过逐元素平均操作进行融合。大小为256的全连接层的两个任务的特定分支[2]使用融合的特征裁剪来回归轴对齐的目标建议框并输出目标/背景“目标性(是目标还是背景)”分数。通过计算锚框和真实值边界框之间的质心(中心)和尺寸的差异(∆tx, ∆ty, ∆tz, ∆dx, ∆dy, ∆dz)来执行3D框回归。平滑L1损失用于3D盒回归,交叉熵损失用于“目标性”的分类判断。与[2]类似,在计算回归损失时忽略背景锚框。通过计算BEV中的锚框和真实边界框之间的2D IoU来确定背景锚框。对于汽车类,IoU小于0.3的锚框被认为是背景锚框,而IoU大于0.5的锚被认为是目标锚框。对于行人和骑自行车两类,目标锚框IoU阈值减少到0.45。为了删除冗余区域建议,在训练期间BEV中IoU阈值为0.8的采用2D非极大抑制(NMS)来保留前1024个区域建议。在推理时,300个区域建议被用于汽车类,而1024个区域被保留给行人和骑自行车的人。

D. Second Stage Detection Network
3D Bounding Box Encoding: In [4], Chen et al. claim that 8 corner box encoding provides better results than the traditional axis aligned encoding previously proposed in [15]. However, an 8 corner encoding does not take into account the physical constraints of a 3D bounding box, as the top corners of the bounding box are forced to align with those at the bottom. To reduce redundancy and keep these physical constraints, we propose to encode the bounding box with four corners and two height values representing the top and bottom corner offsets from the ground plane, determined from the sensor height. Our regression targets are therefore (∆x1…∆x4, ∆y1…∆y4, ∆h1, ∆h2), the corner and height offsets from the ground plane between the proposals and the ground truth boxes. To determine corner offsets, we correspond the closest corner of the proposals to the closest corner of the ground truth box in BEV. The proposed encoding reduces the box representation from an overparameterized 24 dimensional vector to a 10 dimensional one.

D. 第二阶段检测网络
3D边界框编码: 在[4]中,陈等人声称8角盒编码提供了比[15]中先前提出的传统轴对齐编码更好的结果。但是,8角编码没有考虑3D边界框的物理约束,即边界框的顶角需要强制与底部的顶角对齐。为了减少冗余并保持这些物理约束,我们提出使用四个角和两个高度值对边界框进行编码,这两个高度值表示从传感器高度确定的地平面的顶部和底部角点偏移。因此,我们的回归目标是(∆x1…∆x4, ∆y1…∆y4, ∆h1, ∆h2),区域建议区域与真值标注框之间的基于地平面的角点和高度偏移。为了确定角点偏移,我们将区域建议框的最近角点对应于BEV中真值框的最近角点。所提出的编码将框表示从过度参数化的24维向量减少到10维向量。

Explicit Orientation Vector Regression: To determine orientation from a 3D bounding box, MV3D [4] relies on the extents of the estimated bounding box where the orientation vector is assumed to be in the direction of the longer side of the box. This approach suffers from two problems. First, this method fails for detected objects that do not always obey the rule proposed above, such as pedestrians. Secondly, the resulting orientation is only known up to an additive constant of ±π radians. Orientation information is lost as the corner order is not preserved in closest corner to corner matching. Fig. 1 presents an example of how the same rectangular bounding box can contain two instances of an object with opposite orientation vectors. Our architecture remedies this problem by computing (xθ, yθ) = (cos(θ),sin(θ)). This orientation vector representation implicitly handles angle wrapping as every θ ∈ [−π, π] can be represented by a unique unit vector in the BEV space. We use the regressed orientation vector to resolve the ambiguity in the bounding box orientation estimate from the adopted four corner representation, as this experimentally found to be more accurate than using the regressed orientation directly. Specifically, we extract the four possible orientations of the bounding box, and then choose the one closest to the explicitly regressed orientation vector.

显式的方向向量回归: 为了确定来自3D边界框的方向,MV3D[4]依赖于估计的边界框的范围,其中方向向量被假定为在框的较长边的方向上。这种方法存在两个问题。首先,对于不总是遵守上述规则的检测目标,例如行人,该方法失败。其次,所得到的取向仅为已知的±π弧度的附加常数。由于最近的角点到角点匹配未保留角点顺序,因此方向信息丢失。图1给出了相同矩形边界框如何包含具有相反方向矢量的一个目标的两个实例的示例。我们的架构通过计算(xθ, yθ) = (cos(θ),sin(θ))来解决这个问题。该定向矢量表示隐含地处理了角度的一个包装,因为每个θ∈[-π,π]可以表示BEV空间中的唯一单位向量。我们使用回归的方向向量来解决从采用的四角表示中的边界框方向估计的模糊性,因为实验上发现这比直接使用回归方向更准确。具体来说,我们提取边界框的四个可能的方向,然后选择最接近显式回归方向向量的方向.

Generating Final Detections: Similar to the RPN, the inputs to the multiview detection network are feature crops generated from projecting the proposals into the two input views. As the number of proposals is an order of magnitude lower than the number of anchors, the original feature map with a depth of D ~ \widetilde{D} D = 32 is used for generating these feature crops. Crops from both input views are resized to 7×7 and then fused with an element-wise mean operation. A single set of three fully connected layers of size 2048 process the fused feature crops to output box regression, orientation estimation, and category classification for each proposal. Similar to the RPN, we employ a multi-task loss combining two Smooth L1 losses for the bounding box and orientation vector regression tasks, and a cross-entropy loss for the classification task. Proposals are only considered in the evaluation of the regression loss if they have at least a 0.65 or 0.55 2D IoU in BEV with the ground truth boxes for the car and pedestrian/cyclist classes, respectively. To remove overlapping detections, NMS is used at a threshold of 0.01.

生成最终检测: 与RPN类似,多视图检测网络的输入是通过将区域建议投影到两个输入视图中而生成的特征裁剪。由于建议框的数量比锚框的数量低一个数量级,因此使用深度为 D ~ \widetilde{D} D = 32的原始特征图来生成这些特征裁剪。来自两个输入视图的裁剪调整为7×7,然后逐元素取平均操作进行融合。一组大小为2048的三个全连接层处理融合后的特征裁剪,以输出每个建议框的框回归、方向估计和类别分类。与RPN类似,我们采用多任务损失,将边界框和方向向量回归任务的两个平滑L1损失与分类任务的交叉熵损失相结合。如果他们在BEV中与汽车或者行人/骑自行车人的类别的真值框分别至少有0.65或0.55 2D IoU,则评估回归损失时才考虑这些区域建议。NMS使用0.01的阈值去删除重叠检测。

E. Training
We train two networks, one for the car class and one for both the pedestrian and cyclist classes. The RPN and the detection networks are trained jointly in an end-to-end fashion using mini-batches containing one image with 512 and 1024 ROIs, respectively. The network is trained for 120K iterations using an ADAM optimizer with an initial learning rate of 0.0001 that is decayed exponentially every 30K iterations with a decay factor of 0.8.

E. 训练
我们训练两个网络,一个用于汽车类,一个用于行人和自行车两类。RPN和检测网络用小批量样本以流行的端到端方式进行联合训练,样本中每一张图分别包含512和1024个感兴趣区域(ROIs)。网络使用ADAM优化器进行120K次迭代训练,初始学习率为0.0001,以0.8的衰减因子每隔30K次迭代进行一次指数级衰减。

IV. EXPERIMENTS AND RESULTS 实验和结果

We test AVOD’s performance on the proposal generation and object detection tasks on the three classes of the KITTI Object Detection Benchmark [1]. We follow [4] to split the provided 7481 training frames into a training and a validation set at approximately a 1 : 1 ratio. For evaluation, we follow the easy, medium, hard difficulty classification proposed by KITTI. We evaluate and compare two versions of our implementation, Ours with a VGG-like feature extractor similar to [4], and Ours (Feature Pyramid) with the proposed high resolution feature extractor described in Section III-B.

我们在KITTI目标检测基准[1]的三类目标上测试了AVOD的区域建议生成和目标检测任务的性能。我们按照[4]的做法将提供的7481张训练帧以大约1:1的比例分成训练集和验证集。对于评测,我们按照KITTI提出的容易、中等、困难的分类标准进行。我们评估和比较了我们实现的两个版本,类似于[4]的VGG-like的特征提取器,和III-B节描述的我们基于特征金字塔所提出的高分辨率特征提取器。

3D Proposal Recall: 3D proposal generation is evaluated using 3D bounding box recall at a 0.5 3D IoU threshold. We compare three variants of our RPN against the proposal generation algorithms 3DOP [6] and Mono3D [7]. Fig. 5 shows the recall vs number of proposals curves for our RPN variants, 3DOP and Mono3D. It can be seen that our RPN variants outperform both 3DOP and Mono3D by a wide margin on all three classes. As an example, our Feature Pyramid based fusion RPN achieves an 86% 3D recall on the car class with just 10 proposals per frame. The maximum recall achieved by 3DOP and Mono3D on the car class is 73.87% and 65.74% respectively. This gap is also present for the pedestrian and cyclist classes, where our RPN achieves more than 20% increase in recall at 1024 proposals. This large gap in performance suggests the superiority of learning based approaches over methods based on hand crafted features. For the car class, our RPN variants achieve a 91% recall at just 50 proposals, whereas MV3D [4] reported requiring 300 proposals to achieve the same recall. It should be noted that MV3D does not publicly provide proposal results for cars, and was not tested on pedestrians or cyclists.

3D区域建议召回: 3D区域建议生成用3D边界框的召回率进行评测,3D IoU阈值为0.5。我们将我们的RPN与区域建议生成算法3DOP[6] 和 Mono3D[7]进行了3个指标的对比。图5展示了我们的几种RPN变体、3DOP 和 Mono3D的召回率与区域建议数量的变化曲线。能看到我们的几种RPN变体在三种类别上全部以宽幅度超出3DOP 和 Mono3D。例如,我们基于融合RPN的特征金字塔在汽车类上用仅仅每帧10个区域建议获得了86%的3D召回率。然而3DOP 和 Mono3D在汽车类上的各自最大召回率为73.87% 和 65.74%。这一差距同样呈现在行人和自行车类上,我们的RPN在1024个区域建议上获得了超过20%的召回率增长。这种巨大的性能差距表明基于学习的方法优于基于手工特征的方法。对于汽车类,我们的几种RPN变体仅在50个区域建议中实现了91%的召回,而MV3D[4]报告的实现相同的召回则需要300个区域建议。应该注意的是,MV3D没有公开提供汽车的区域建议结果,也没有对行人或骑自行车者进行测试。
3D区域建议召回比较
Fig. 5: Recall vs. number of proposals at a 3D IoU threshold of 0.5 for the three classes evaluated on the validation set at moderate difficulty.
图5:在中等难度下在验证集上评估的三个类的3D IoU阈值为0.5时的召回率与区域建议数量的比较。

3D Object Detection: 3D detection results are evaluated using the 3D and BEV AP and Average Heading Similarity (AHS) at 0.7 IoU threshold for the car class, and 0.5 IoU threshold for the pedestrian and cyclist classes. The AHS is the Average Orientation Similarity (AOS) [1], but evaluated using 3D IOU and global orientation angle instead of 2D IOU and observation angle, removing the metric’s dependence on localization accuracy. We compare against publicly provided detections from MV3D [4] and Deep3DBox [14] on the validation set. It has to be noted that no currently published method publicly provides results on the pedestrian and cyclist classes for the 3D object detection task, and hence comparison is done for the car class only. On the validation set (Table I), our architecture is shown to outperform MV3D by 2.09% AP on the moderate setting and 4.09% on the hard setting. However, AVOD achieves a 30.36% and 28.42% increase in AHS over MV3D at the moderate and hard setting respectively. This can be attributed to the loss of orientation vector direction discussed in Section III-D resulting in orientation estimation up to an additive error of ±π radians. To verify this assertion, Fig. 7 shows a visualization of the results of AVOD and MV3D in comparison to KITTI’s ground truth. It can be seen that MV3D assigns erroneous orientations for almost half of the cars shown. On the other hand, our proposed architecture assigns the correct orientation for all cars in the scene. As expected, the gap in 3D localization performance between Deep3DBox and our proposed architecture is very large. It can be seen in Fig. 7 that Deep3DBox fails at accurately localizing most of the vehicles in 3D. This further enforces the superiority of fusion based methods over monocular based ones. We also compare the performance of our architecture on the KITTI test set with MV3D, VoxelNet[9], and F-PointNet[11]. Test set results are provided directly by the evaluation server, which does not compute the AHS metric. Table II shows the results of AVOD on KITTI’s test set. It can be seen that even with only the encoder for feature extraction, our architecture performs quite well on all three classes, while being twice as fast as the next fastest method, F-PointNet. However, once we add our high-resolution feature extractor (Feature Pyramid), our architecture outperforms all other methods on the car class in 3D object detection, with a noticeable margin of 4.19% on hard (highly occluded or far) instances in comparison to the second best performing method, F-PointNet. On the pedestrian class, our Feature Pyramid architecture ranks first in BEV AP, while scoring slightly above F-PointNet on hard instances using 3D AP. On the cyclist class, our method falls short to F-PointNet. We believe that this is due to the low number of cyclist instances in the KITTI dataset, which induces a bias towards pedestrian detection in our joint pedestrian/cyclist network.

3D 目标检测: 3D检测结果使用3D和BEV AP和平均航向相似度(AHS)进行评测,汽车类IoU阈值为0.7 ,行人和骑车者IoU阈值为0.5。AHS是平均方向相似度(AOS)[1],但使用3D IOU和全局方位角而不是2D IOU和观察角进行评测,从而消除了度量对定位精度的依赖性。我们在验证集上与MV3D[4]和Deep3DBox[14]的公开提供的检测结果进行比较。 必须注意的是,目前没有公开的方法公开提供关于3D目标检测任务的行人和骑车者类的结果,因此仅对汽车类进行比较。在验证集(表I)上,我们的结构在中等设置下的性能优于MV3D 2.09%AP,在困难置上优于4.09%。然而,AVOD在中等和困难设置下的AHS分别比MV3D增加了30.36%和28.42%。这可归因于第III-D部分中讨论的取向矢量方向的损失,使得方向估计达到±π弧度的附加误差。为了验证这以判断,图7显示了与KITTI的真实值相比AVOD和MV3D结果的可视化。 可以看出,MV3D为所示车辆的近一半分配了错误的方向。 相反,我们提出的架构为场景中的所有车辆分配了正确的方向。正如所料,Deep3DBox与我们提出的架构之间的3D定位性能差距非常大。 从图7中可以看出,Deep3DBox无法准确定位3D中的大多数车辆。这进一步强化了基于融合的方法优于基于单目的方法的优越性。我们还在KITTI测试集上将我们架构的性能与MV3D,VoxelNet[9]和F-PointNet[11]进行了比较。测试集结果由评估服务器直接提供,评估服务器不计算AHS指标。 表II显示了KITTI测试集上AVOD的结果。 可以看出,即使仅使用编码器进行特征提取,我们的架构在所有三个类上都表现良好,而速度是第二快方法F-PointNet的两倍。然而,一旦我们添加了我们的高分辨率特征提取器(特征金字塔),我们的架构在3D目标检测中优于汽车类的所有其他方法,与表现第二好的方法F-PointNet相比在难(高度遮挡或很远)实例上有明显4.19%的余量。 在行人类上,我们的特征金字塔架构在BEV AP中排名第一,而在使用3D AP的难实例上得分略高于F-PointNet。在骑车者类上,我们的方法比F-PointNet差很多。 我们认为这是由于KITTI数据集中的骑车者实例数量较少,这导致我们的行人/骑车者联合网络中的行人检测的偏差。(将骑车者检测为行人?)
表一
TABLE I: A comparison of the performance of Deep3DBox [14], MV3D [4], and our method evaluated on the car class in the validation set. For evaluation, we show the AP and AHS (in %) at 0.7 3D IoU.
表I:Deep3DBox[14],MV3D[4]和我们的方法在验证集中的汽车类上评估的性能比较。 为了评估,我们以0.7的 3D IoU阈值显示AP和AHS(%)。

方向定位
Fig. 7: A qualitative comparison between MV3D [4], Deep3DBox [14], and our architecture relative to KITTI’s ground truth on a sample in the validation set.
图7:MV3D[4],Deep3DBox[14]和我们的架构相对于KITTI验证集中样本的标注真值的定性比较。

KITTI测试集结果比较
TABLE II: A comparison of the performance of AVOD with the state of the art 3D object detectors evaluated on KITTI’s test set. Results are generated by KITTI’s evaluation server [19].
表II:AVOD与目前先进的3D对象检测器在KITTI测试集上评估的性能比较。 结果由KITTI的评估服务器生成[19]。

Runtime and Memory Requirements: We use FLOP count and number of parameters to assess the computational efficiency and the memory requirements of the proposed network. Our final Feature Pyramid fusion architecture employs roughly 38.073 million parameters, approximately 16% that of MV3D. The deep fusion scheme employed by MV3D triples the number of fully connected layers required for the second stage detection network, which explains the significant reduction in the number of parameters by our proposed architecture. Furthermore, our Feature Pyramid fusion architecture requires 231.263 billion FLOPs per frame allowing it to process frames in 0.1 seconds on a TITAN Xp GPU, taking 20ms for pre-processing and 80ms for inference. This makes it 1.7× faster than F-PointNet, while maintaining state-of-the-art results. Finally, our proposed architecture requires only 2 gigabytes of GPU memory at inference time, making it suitable to be used for deployment on autonomous vehicles.

运行时和内存需求: 我们使用FLOP计数和参数数量来评估所提出的网络的计算效率和存储器需求。我们最终的特征金字塔融合架构使用了大约3807.3万个参数,大约是MV3D的16%。MV3D采用的深度融合方案使第二级检测网络所需的全连接层数增加了三倍,这解释了为什么我们提出的架构显著减少了参数数量。此外,我们的特征金字塔融合架构每帧需要231.263亿FLOPs运算,允许它在TITAN Xp GPU上以每帧0.1秒的速度进行处理,预处理时间为20ms,推理时间为80ms。这使得它比F-PointNet快1.7倍,同时保持最先进的结果。 最后,我们提出的架构在推理时只需要2 GB的GPU内存,因此适合用于自动驾驶汽车上的部署。

A. Ablation Studies:
Table III shows the effect of varying different hyperparameters on the performance measured by the AP and AHS, number of model parameters, and FLOP count of the proposed architecture. The base network uses hyperparameter values described throughout the paper up to this point, along with the feature extractor of MV3D. We study the effect of the RPN’s input feature vector origin and size on both the proposal recall and final detection AP by training two networks, one using BEV only features and the other using feature crops of size 1×1 as input to the RPN stage. We also study the effect of different bounding box encoding schemes shown in Fig. 4, and the effects of adding an orientation regression output layer on the final detection performance in terms of AP and AHS. Finally, we study the effect of our high-resolution feature extractor, compared to the original one proposed by MV3D.

A. 消融研究:
表III显示了变化的不同超参数对AP和AHS测量的性能、模型参数数量、以及所提出的架构的FLOP计数的影响。到目前为止,基础网络使用了整篇论文中描述的超参数值,以及MV3D的特征提取器。我们通过训练两个网络来研究RPN的输入特征向量源和大小对区域建议召回和最终检测AP的影响,一个仅使用BEV特征,另一个使用大小为1×1的特征裁剪作为RPN阶段的输入。我们还研究了图4所示的不同边界框编码方案的效果,以及根据AP和AHS添加方向回归输出层对最终检测性能的影响。最后,我们研究了我们的高分辨率特征提取器与MV3D提出的原始提取器相比的效果。

RPN Input Variations: Fig. 5 shows the recall vs number of proposals curves for both the original RPN and BEV only RPN without the feature pyramid extractor on the three classes on the validation set. For the pedestrian and cyclist classes, fusing features from both views at the RPN stage is shown to provide a 10.1% and 8.6% increase in recall over the BEV only version at 1024 proposals. Adding our high-resolution feature extractor increases this difference to around 10.5% and 10.8% for the respective classes. For the car class, adding image features as an input to the RPN, or using the high resolution feature extractor does not seem to provide a higher recall value over the BEV only version. We attribute this to the fact that instances from the car class usually occupy a large space in the input BEV map, providing sufficient features in the corresponding output low resolution feature map to reliably generate object proposals. The effect of the increase in proposal recall on the final detection performance can be observed in Table III. Using both image and BEV features at the RPN stage results in a 6.9% and 9.4% increase in AP over the BEV only version for the pedestrian and cyclist classes respectively.

RPN输入变体: 图5显示了在验证集上的三个类上没有特征金字塔提取器的原始RPN和仅使用BEV视图的RPN的召回率与区域建议数量曲线。 对于行人和骑车者类别,RPN阶段两个视图的融合特征在1024个区域建议中比仅使用BEV视图的版本的召回率提高了10.1%和8.6%。 添加我们的高分辨率特征提取器可将相应类别的差异增加到10.5%和10.8%左右。对于汽车类,将图像特征增加为RPN的输入,或使用高分辨率特征提取器,似乎都没有取得比仅使用BEV的版本更高的召回值。我们将此归因于以下事实:来自汽车类的实例通常占据输入BEV图中的大空间,在相应的低分辨率特征图输出中提供了足够的特征,可以可靠地生成区域建议。表III中可以观察到区域建议召回增加对最终检测性能的影响。在RPN阶段使用图像和BEV特征使得AP分别比行人和骑车人类别仅使用BEV增加6.9%和9.4%。

Bounding Box Encoding: We study the effect of different bounding box encodings shown in Fig. 4 by training two additional networks. The first network estimates axis aligned bounding boxes, using the regressed orientation vector as the final box orientation. The second and the third networks use our 4 corner and MV3D’s 8 corner encodings without additional orientation estimation as described in Section III-D. As expected, without orientation regression to provide orientation angle correction, the two networks employing the 4 corner and the 8 corner encodings provide a much lower AHS than the base network for all three classes. This phenomenon can be attributed to the loss of orientation information as described in Section III-D.

边界框编码: 我们通过训练两个额外的网络来研究图4中所示的不同边界框编码的效果。第一个网络使用回归的方向向量作为最终的盒子方向来估计轴对齐的边界框。第二和第三网络使用我们的4角和MV3D的8角编码,而没有第III-D节中描述的额外的方向估计。 如所预期的,在没有方向回归以提供方位角校正的情况下,采用4角和8角编码的两个网络为所有三个类提供比基础网络低得多的AHS。 这种现象可归因于第III-D节中描述的方向信息的丢失。

Feature Extractor: We compare the detection results of our feature extractor to that of the base VGG-based feature extractor proposed by MV3D. For the car class, our pyramid feature extractor only achieves a gain of 0.3% in AP and AHS. However, the performance gains on smaller classes is much more substantial. Specifically, we achieve a gain of 19.3% and 8.1% AP on the pedestrian and cyclist classes respectively. This shows that our high-resolution feature extractor is essential to achieve state-of-the-art results on these two classes with a minor increase in computational requirements.

特征提取器: 我们将特征提取器的检测结果与MV3D提出的基于VGG的基础特征提取器的检测结果进行比较。对于汽车类,我们的金字塔特征提取器仅在AP和AHS中实现0.3%的增益。具体而言,我们分别在行人和骑车者类获得了19.3%和8.1%的AP增益。 这表明我们的高分辨率特征提取器对于在这两个类上实现最先进的结果至关重要,但只带来计算需求的略有增加。
消融实验
TABLE III: A comparison of the performance of different variations of hyperparameters, evaluated on the validation set at moderate difficulty. We use a 3D IoU threshold of 0.7 for the Car class, and 0.5 for the pedestrian and cyclist classes. The effect of variation of hyperparameters on the FLOPs and number of parameters are measured relative to the base network.
表III:在中等难度下在验证集上评估的超参数的不同变化的性能比较。 我们对骑车类使用了0.7 IoU阈值,对于行人和骑车者类使用了0.5。 相对于基础网络测量超参数的变化对FLOPs和参数数量的影响。

Qualitative Results: Fig. 6 shows the output of the RPN and the final detections in both 3D and image space. More qualitative results including those of AVOD running in snow and night scenes are provided at https://youtu.be/mDaqKICiHyA.

定性结果: 图6显示了RPN的输出以及3D和图像空间中的最终检测结果。查看更多定性结果(包括AVOD运行在雪地和夜景环境下)的链接:https://youtu.be/mDaqKICiHyA.

定性结果
Fig. 6: Qualitative results of AVOD for cars (top) and pedestrians/cyclists (bottom). Left: 3D region proposal network output, Middle: 3D detection output, and Right: the projection of the detection output onto image space for all three classes. The 3D LIDAR point cloud has been colorized and interpolated for better visualization.
图6:AVOD在汽车(上图)和行人/骑车人(下图)上的定性结果。 左侧:3D区域建议网络输出,中间:3D检测输出,右侧:检测输出到所有三个类的图像空间的投影。 3D LIDAR点云已经过着色和插值,以实现更好的可视化。

V. CONCLUSION 结论

In this work we proposed AVOD, a 3D object detector for autonomous driving scenarios. The proposed architecture is differentiated from the state of the art by using a high resolution feature extractor coupled with a multimodal fusion RPN architecture, and is therefore able to produce accurate region proposals for small classes in road scenes. Furthermore, the proposed architecture employs explicit orientation vector regression to resolve the ambiguous orientation estimate inferred from a bounding box. Experiments on the KITTI dataset show the superiority of our proposed architecture over the state of the art on the 3D localization, orientation estimation, and category classification tasks. Finally, the proposed architecture is shown to run in real time and with a low memory overhead.

在这项工作中,我们提出了AVOD,一种用于自动驾驶场景的3D物体检测器。所提出的架构与其他先进的技术不同,其使用与多模态融合RPN架构耦合的高分辨率特征提取器,能够为道路场景中的小尺寸类别产生精确的区域建议。此外,所提出的架构采用显式的方向向量回归来解决模糊的边界框方向估计推断。KITTI数据集上的实验表明,我们提出的架构在3D定位、方向估计和类别分类任务方面优于当前其他先进技术。 最后,可以看到所提出的架构实时运行且内存开销较低。

REFERENCES 引用

[1] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354–3361.
[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, 2015, pp. 91–99.
[3] B. Li, T. Zhang, and T. Xia, “Vehicle detection from 3d lidar using fully convolutional network,” in Proceedings of Robotics: Science and Systems, AnnArbor, Michigan, June 2016.
[4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on.
[5] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Computer Vision
and Pattern Recognition, vol. 1, no. 2, 2017, p. 4.
[6] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for accurate object class detection,” in NIPS, 2015.
[7] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun, “Monocular 3d object detection for autonomous driving,” in Computer Vision and Pattern Recognition, 2016.
[8] B. Li, “3d fully convolutional network for vehicle detection in point cloud,” in IROS, 2017.
[9] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” arXiv preprint arXiv:1711.06396, 2017.
[10] J. Lahoud and B. Ghanem, “2d-driven 3d object detection in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4622–4630.
[11] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” arXiv preprint arXiv:1711.08488, 2017.
[12] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” arXiv preprint arXiv:1612.00593, 2016.
[13] F. Chabot, M. Chaouch, J. Rabarisoa, C. Teuliere, and T. Chateau, “Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[14] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[15] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object detection in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 808–816.
[16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs for modern convolutional object detectors,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[18] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[19] “Kitti 3d object detection benchmark,” http://www.cvlibs.net/datasets/kitti/eval object.php?obj benchmark=3d, accessed: 2018-02-28.

  • 5
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: "few-shot object detection with attention-rpn and multi-relation detector" 是一种使用注意力机制的少样本目标检测方法。它通过使用 Attention-RPN(Region Proposal Network)和 Multi-Relation Detector 来实现对目标的检测。 Attention-RPN 可以在提议区域中识别关键部位,而 Multi-Relation Detector 则可以在少量样本中识别目标并定位它们。这种方法在训练和测试时都需要少量样本,因此可以减少模型的训练时间和资源消耗。 ### 回答2: 随着人工智能技术的不断发展,目标检测的研究也得到了越来越多的关注。其中,Few-shot object detection with attention-rpn and multi-relation detector是目前在目标检测领域上的一个最新研究成果。那这个算法是什么呢? 针对目前目标检测领域中的一大难点——少样本学习,此研究提出了一种基于RPN(region proposal network)和注意力机制的多关系检测算法,使得模型只需使用少量的训练数据,就能在未见过的类别中达到较高的检测准确率。 具体来说,该算法通过在RPN中引入注意力交互模块来提供精细的检测区域,同时通过设计多组关系特征提取器,能够有效处理不同目标类别之间的相互关系。在训练阶段,该算法将训练数据集划分为meta-train和meta-test集合,然后在较小的meta-train集合中学习关系特征提取器和注意力交互模块,最后在meta-test集合的未知类别中进行目标检测。 综合以上基本思路,该算法通过引入注意力机制和多关系特征提取器来实现Few-shot object detection。该算法在目前的Few-shot目标检测基准测试数据集上进行了实验证明,实现了较高的检测准确率,在很大程度上解决了少样本学习的问题。未来,这个技术还需要进一步实践和推广,使得得到更广泛的使用。 ### 回答3: 本文介绍了一种基于注意力机制RPN(Attention-RPN)和多关系检测器(Multi-Relation Detector)的小样本目标检测技术(Few-shot Object Detection)。该技术可以利用预训练的模型来辅助小样本检测任务,并可以适应新的目标类别。 本文中的Attention-RPN是一种针对小样本学习的改进版本,它可以通过选择性的关注训练数据中的重要区域来提高小样本的性能。同时,Attention-RPN还可以利用先前训练模型的知识来指导小样本的训练过程,从而提高检测结果的准确性。 而多关系检测器则是一种可以检测目标之间关系的模型。通过学习目标之间的关系,可以更好地理解图像中的场景,并且可以更准确地定位和分类目标。本文中的多关系检测器采用了一种新的模型结构,其中用到了一种称为Transformers的自注意力机制,它可以自适应地聚焦于任务中的关键区域,从而提高检测性能。 在实验中,本文采用了COCO、VOC和miniImagenet等数据集进行测试。结果表明,本文所提出的Few-shot Object Detection技术可以在少量样本的情况下取得好的检测结果。同时,Attention-RPN和Multi-Relation Detector也能分别提高小样本和多样本的检测性能,证明它们是十分有效的模型改进方式。 综上所述,本文提出了一种新的小样本目标检测技术,并通过Attention-RPN和Multi-Relation Detector的改进来提高检测性能。该技术对于具有高效率和精度要求的目标检测任务具有十分重要的意义,可能对未来的计算机视觉研究和工业应用产生积极的影响。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值