PointPillars: Fast Encoders for Object Detection from Point Clouds 论文笔记

最新推荐文章于 2023-12-05 07:00:00 发布

Tianchao龙虾

最新推荐文章于 2023-12-05 07:00:00 发布

阅读量564

点赞数 1

分类专栏： 3D目标检测论文笔记文章标签：深度学习人工智能计算机视觉

本文链接：https://blog.csdn.net/wuchaohuo724/article/details/115693167

版权

3D目标检测论文笔记专栏收录该内容

10 篇文章 3 订阅

订阅专栏

PointPillars: Fast Encoders for Object Detection from Point Clouds

论文链接： https://arxiv.org/abs/1812.05784

一、Problem Statement

作者认为目前对于point cloud 的两种encoding 方式，精度与速度不能很好的兼备，因此提出一种新的encoding 方式，名为PointPillars。

voxel:
point cloud is organized in voxels and the set of voxels in each vertical column is encoded into a fixed-length, hand-crafted, feature encoding to form a pseudo-image which can be processed by a standard image detection architecture.
这类方法典型的有:MV3D, AVOD, PIXOR, Complex YOLO
pointcloud
PointNets are applied to voxels which are then processed by a set of 3D convolutional layers followed by a 2D backbone and a detection head.
这类方法典型的有: voxelnet, second

二、 Direction

提出一个新的encoder，使其利用PointNet去学习点云表征，这些点云是通过vertical columns(pillars) 进行encoding的。这个检测方法是 end-to-end learnining with only 2D convolutional layers。

三、 Method

1. 网络结构

PointPillars 输入 point clouds 然后输出 oriented 3D boxes。主要包括三个阶段:
1. A feature encoder network: 把点云转换为稀疏的pseudo-image
2. 2D convolutional backbone: 处理pseudo-image
3. A detection head: 识别和回归3D boxes

(1) A feature encoder network

第一阶段先将point cloud 转为pseudo-image:
将一个点表示为: $x, y, z, r$

离散化 point cloud
在x-y平面上將point cloud 离散化为 spaced grid，并创建pillars集合 $∣ P ∣ = B$ 。不需要超参数来控制z轴上的binning。
每一个pillar上的点，就会被增强 $x_c,y_c,z_c,x_p,y_p$ , 下标c代表到pillar中所有点算术平均的距离，下标p代表离pillar中心 $x, y$ 的补偿。所以目前一个点的维度为D=9: $x,y,z,r,x_c,y_c,z_c,x_p,y_p)$ 。
考虑到每个pillar包含的点数不一致，有可能是很稀疏的，有可能是密集的。因此如果点多，就randomly sampled,如果点少，就zero padding,最终使得点云变为tensor格式，维度为 $(D, P, N)$ 。
D 代表是 channel number
P 代表non-empty pillar number
N 代表一个pillar内有points的数目
然后使用PointNet、linear layer、BN、ReLU 生成 $(C, P, N)$ 维度的tensor，然后再通过max operation生成 $(C, P)$ 维度的tensor。
最后这个features 回到其pillars原始的位置，形成一个pseudo-image,维度为 $(C, H, W)$ , H是高，W是宽。

(2) 2D convolutional backbone

与voxelnet 相似的backbone，其含有两个子网络:

top-down
这个网络产生越来越小的特征
upsample and concatenation
这个网络对特征进行upsample和concatenation。

(3) Detection head

使用SSD来进行3D 目标检测。与SSD类似，作者使用先验框与真值进行2D IoU匹配。Bounding box的高度不进行匹配，而是额外的回归参数目标。

2.目标函数

与second中的loss function 一样。ground truth boxes和anchors定义为: $(x,y,z,w,l,h,\theta)$

$L_{total}=\frac{1}{N_{pos}}(\beta_{loc} L_{loc} + \beta_{cls} L_{cls} + \beta_{dir} L_{dir})$

其中， $N_{pos}$ 表示positive anchors的数量。

(1) localization loss

$\begin{matrix} \triangle x = \frac{x^{gt}-x^a}{d^a}, \triangle y = \frac{y^{gt}-y^a}{d^a}, \triangle z = \frac{z^{gt}-z^a}{d^a} \\ \\ \triangle w = log \frac{w^{gt}}{w^a}, \triangle l = log \frac{l^{gt}}{l^a}, \triangle h = log \frac{h^{gt}}{h^a} \\ \\ \triangle \theta = sin(\theta^{gt}-\theta^a) \end{matrix}$