SECOND: Sparsely Embedded Convolutional Detection
文章:SECOND
该文章是已经离职百度的李博所创的公司(主线科技)发表的。
Abstract
问题:
- 推理速度较慢
- 方向评估表现不好
方法:
- 改善后的稀疏卷积方法,显著提升训练和推理速度
- 提出一个角度损失回归方法,改善方向评估表现
- 一种新的数据增强方法,提升收敛速度和表现
结果:
- 在当时的
KITTI
上表现最好 - 推理速度相对较快,对于较大的模型20
FPS
,较小的模型40FPS
。
Introduction
VoxelNet
: 实现了单级端到端的网络,而且表现不错,但是非常耗时,推理速度太慢。
-
文章引入了
SECOND
方法,引入了稀疏卷积方法,尝试解决这个问题。同时还对稀疏卷积引入了基于GPU
的规则生成算法加快速度。 -
点云的另外一个优势是通过直接对目标上的点进行变换进行缩放、旋转和平移。所以引入了一个数据增强的方法,该方法显著增强收敛速度和模型的最后表现。
-
一种新的角度损失回归方法来解决真实和预测之间的角度等于 π \pi π的差值时的大损失问题。
Related Work
Front-View- and Image-Based Methods
Image-based methods:
Monocular 3D object detection for autonomous driving
,
3D bounding box estimation using deep learning and geometry
Front-view-based methods:
Vehicle detection from 3D lidar using fully convolutional network
,
Bird’s-Eye-View-Based Methods
MV3D
, Complex-YOLO
, PIXOR
3D-Based Methods
VoteNet
, Vote3deep
, Pointnet
, Pointnet++
, PointCNN
, 3D FCN
, VoxelNet
,
Fusion-Based Methods
Deep sliding shapes for amodal 3D object detection in rgb-d images
, MV3D
, Joint 3D Proposal Generation and Object Detection from View Aggregation
, F-Pointnet
SECOND Detector
Network Architecture
second detector:
- 逐体素的特征提取器
- 稀疏卷积中间层
- RPN
Point Cloud Grouping
[[3, 1] × [[40, 40] × [0, 70.4] m along the z × y × x axes
For pedestrian and cyclist detection, we use crop points at [[3, 1] × [[20, 20] × [0, 48] m
For our smaller model, we use only points within the range of [[3, 1] × [[32, 32] × [0, 52.8] m
voxel size of vD = 0.4 × vH = 0.2 × vW = 0.2 m
car detection: T = 35
pedestrian and cyclist detection: T = 45
Voxelwise Feature Extractor
该部分参考VoxelNet
的voxel feature encode layer
得到的矩阵特征矩阵为:
Car: D'x H' x W' = 10x400x352
Pedestrian, cyclist: D'x H' x W' = 10x400x240
small model: D'x H' x W' = 10x320x264
Sparse Convolutional Middle Extractor
Sparse Convolution Algorithm
参考3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
,了解稀疏卷积的实现方式。
Region Proposal Network
类似于SSD
的RPN
结构;
Anchors and Targets
Car: w = 1.6 × l = 3.9 × h = 1.56 m, centered at z = -1.0 m
pedestrians: w = 0.6 × l = 0.8 × h = 1.73 m z = -0.6m
cyclists: w = 0.6 × l = 1.76 × h = 1.73 m
car: IoU threshold 0.45 - 0.6
pedestrians and cyclists: IoU threshold 0.35 - 0.5
Training and Inference
Loss
Sine-Error Loss for Angle Regression
Focal Loss for Classification
Total Training Loss
Data Augmentation
Sample Ground Truths from the Database
- 从数据库中生成一个数据库,其中包含了所有真实标签和它们的点云数据
- 从数据库中随机选择一些真实样本,并将它们放入当前的训练点云中
Object Noise
对每个真实样本和其对应点使用从均匀分布采样∆θ∈[π/2,π/2]进行随机旋转和从高斯分布(平均为零,标准差为1.0)采样的随机线性变换。
Global Rotation and Scaling
对所有点云和真实边界框进行全局缩放和旋转。 缩放采用的是均匀分布 [0.95, 1.05],旋转采用的是[-π/4, π/4]
Optimization
Adam
GTX 1080 Ti GPU
160 epochs
initial learning rate was 0.0002, with an exponential decay factor of 0.8 and a decay every 15 epochs
A decay weight of 0.0001, a beta1 value of 0.9 and a beta2 value of 0.999 were used
Network Details
RPN细节:
Experiments
training set of 3712
evaluation set of 3769
level of difficulty: easy, moderate and hard
Evaluation Using the KITTI Test Set
3D detection performance:
BEV detection performance:
Evaluation Using the KITTI Validation Set
3D detection performance:
BEV detection performance:
Ablation Studies
Sparse Convolution Performance
Sampling Ground Truths for Faster Convergence
Conclusions
- 改善后的稀疏卷积方法,显著提升训练和推理速度
- 提出一个角度损失回归方法,改善方向评估表现
- 一种新的数据增强方法,提升收敛速度和表现