论文地址
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
0. Abstract 摘要
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues.
点云是一种重要的几何数据结构。由于其不规则的格式,大多数研究者将这类数据转换为规则的三维体素网格或图像集合。然而,这使得数据变得不必要地庞大,并引发了一些问题。
In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, pro- vides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective.
在本文中,我们设计了一种新型神经网络,能够直接处理点云,并很好地保持输入中点的排列不变性。我们的网络名为PointNet,为从物体分类、部件分割到场景语义解析的各种应用提供了统一的架构。尽管结构简单,PointNet却非常高效且有效。
Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.
实验表明,它的表现与当前最先进的方法相当,甚至更优。在理论上,我们还提供了对网络所学内容的分析,并解释了为什么网络对输入的扰动和损坏具有稳健性。
1. Introduction 引言
In this paper we explore deep learning architectures capable of reasoning about 3D geometric data such as point clouds or meshes. Typical convolutional architectures require highly regular input data formats, like those of image grids or 3D voxels, in order to perform weight sharing and other kernel optimizations. Since point clouds or meshes are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture. This data representation transfor- mation, however, renders the resulting data unnecessarily voluminous — while also introducing quantization artifacts that can obscure natural invariances of the data.
在本文中,我们探讨了能够对三维几何数据(如点云或网格)进行推理的深度学习架构。典型的卷积架构需要高度规则的输入数据格式,如图像网格或三维体素,以便进行权重共享和其他卷积核优化。由于点云或网格并不是规则格式,大多数研究者通常将此类数据转换为规则的三维体素网格或图像集合(例如视图)后再输入深度网络架构。然而,这种数据表示转换使得生成的数据变得不必要地庞大,同时也引入了量化误差,可能会掩盖数据的自然不变性。
For this reason we focus on a different input rep- resentation for 3D geometry using simply point clouds
– and name our resulting deep nets PointNets. Point clouds are simple and unified structures that avoid the combinatorial irregularities and complexities of meshes, and thus are easier to learn from. The PointNet, however, still has to respect the fact that a point cloud is just a set of points and therefore invariant to permutations of its members, necessitating certain symmetrizations in the net computation. Further invariances to rigid motions also need to be considered.
因此,我们专注于使用点云作为三维几何数据的不同输入表示形式,并将我们设计的深度网络命名为PointNets。点云是简单且统一的结构,避免了网格的组合不规则性和复杂性,因此更容易进行学习。然而,PointNet必须考虑到点云仅仅是一个点集,因此对其成员的排列不敏感(即排列不变性),这要求在网络计算中引入某些对称化处理。还需要考虑对刚性运动(如旋转和平移)的不变性。
Figure 1. Applications of PointNet. We propose a novel deep net architecture that consumes raw point cloud (set of points) without voxelization or rendering. It is a unified architecture that learns both global and local point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks.
图1. PointNet的应用。我们提出了一种新型深度网络架构,可以直接处理原始点云(点的集合),无需进行体素化或渲染。这是一个统一的架构,能够学习全局和局部的点特征,为多种三维识别任务提供了一种简单、高效且有效的方法。
Our PointNet is a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input. The basic architecture of our network is surprisingly simple as in the initial stages each point is processed identically and independently. In the basic setting each point is represented by just its three coordinates (x, y, z). Additional dimensions may be added by computing normals and other local or global features.
我们的PointNet是一个统一的架构,它直接以点云作为输入,并输出整个输入的类别标签或每个点的分割/部件标签。该网络的基本架构非常简单,在初始阶段,每个点都是以相同方式、独立地进行处理。在基本设置中,每个点仅用它的三个坐标(x, y, z)表示。还可以通过计算法线和其他局部或全局特征来添加更多维度。
Key to our approach is the use of a single symmetric function, max pooling. Effectively the network learns a set of optimization functions/criteria that select interesting or informative points of the point cloud and encode the reason for their selection. The final fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape as mentioned above (shape classification) or are used to predict per point labels (shape segmentation).
我们方法的关键是使用了一个单一的对称函数——最大池化(max pooling)。该网络实际上学习了一组优化函数/标准,用来选择点云中有趣或有信息的点,并对其选择原因进行编码。网络的最后几层全连接层将这些学到的最优值聚合为整个形状的全局描述符(如前所述的形状分类),或用于预测每个点的标签(形状分割)。
Our input format is easy to apply rigid or affine transfor- mations to, as each point transforms independently. Thus we can add a data-dependent spatial transformer network that attempts to canonicalize the data before the PointNet processes them, so as to further improve the results.
由于每个点是独立变换的,我们的输入格式很容易应用刚性或仿射变换。因此,我们可以添加一个与数据相关的空间变换网络,在PointNet处理数据之前,尝试将数据规范化,以进一步改善结果。
We provide both a theoretical analysis and an ex- perimental evaluation of our approach. We show that our network can approximate any set function that is continuous. More interestingly, it turns out that our network learns to summarize an input point cloud by a sparse set of key points, which roughly corresponds to the skeleton of objects according to visualization. The theoretical analysis provides an understanding why our PointNet is highly robust to small perturbation of input points as well as to corruption through point insertion (outliers) or deletion (missing data).
我们提供了对该方法的理论分析和实验评估。我们证明了我们的网络可以逼近任何连续的集合函数。更有趣的是,网络学会了通过一组稀疏的关键点来总结输入的点云,按照可视化结果,这大致对应于物体的骨架。理论分析解释了为什么我们的PointNet对输入点的小扰动、点插入(异常点)或删除(缺失数据)具有高度鲁棒性。
On a number of benchmark datasets ranging from shape classification, part segmentation to scene segmentation, we experimentally compare our PointNet with state-of- the-art approaches based upon multi-view and volumetric representations. Under a unified architecture, not only is our PointNet much faster in speed, but it also exhibits strong performance on par or even better than state of the art.
在多个基准数据集上,从形状分类、部件分割到场景分割,我们对比了PointNet与基于多视图和体素表示的最先进方法的实验表现。在统一架构下,PointNet不仅速度更快,而且表现与当前最先进的方法相当,甚至更优。
The key contributions of our work are as follows:
- We design a novel deep net architecture suitable for consuming unordered point sets in 3D;
- We show how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;
- We provide thorough empirical and theoretical analy- sis on the stability and efficiency of our method;
- We illustrate the 3D features computed by the selected neurons in the net and develop intuitive explanations for its performance.
我们工作的主要贡献如下:
- 我们设计了一种适合处理三维无序点集的新型深度网络架构;
- 我们展示了如何训练该网络来执行三维形状分类、部件分割和场景语义解析任务;
- 我们提供了对该方法的稳定性和效率的全面实验与理论分析;
- 我们展示了网络中选定神经元计算的三维特征,并为其性能提供了直观的解释。
The problem of processing unordered sets by neural nets is a very general and fundamental problem – we expect that our ideas can be transferred to other domains as well.
通过神经网络处理无序集合是一个非常普遍和基础的问题——我们期待我们的思路可以应用到其他领域。
2. Related Work 相关工作
Point Cloud Features 点云特征
Most existing features for point cloud are handcrafted towards specific tasks. Point features often encode certain statistical properties of points and are designed to be invariant to certain transformations, which are typically classified as intrinsic [2, 24, 3] or extrinsic [20, 19, 14, 10, 5]. They can also be categorized as local features and global features. For a specific task, it is not trivial to find the optimal feature combination.
目前大多数点云特征是为特定任务手工设计的。点特征通常编码点的某些统计属性,并设计为对某些变换不变,这些特征通常被分类为内在特征或外在特征。它们还可以被分类为局部特征和全局特征。对于特定任务,找到最佳特征组合并非易事。
Deep Learning on 3D 三维数据的深度学习
Data 3D data has multiple popular representations, leading to various approaches for learning.
三维数据有多种流行表示形式,导致了多种学习方法。
Volumetric CNNs: [28, 17, 18] are the pioneers applying 3D convolutional neural networks on voxelized shapes. However, volumetric representation is constrained by its resolution due to data sparsity and computation cost of 3D convolution. FPNN [13] and Vote3D [26] proposed special methods to deal with the sparsity problem; however, their operations are still on sparse volumes, it’s challenging for them to process very large point clouds.
体素卷积神经网络:这些文献是将三维卷积神经网络应用于体素化形状的开创者。然而,体素表示受到数据稀疏性和三维卷积计算成本的分辨率限制。FPNN和Vote3D提出了处理稀疏性问题的特殊方法,但它们的操作仍然基于稀疏体素,在处理非常大的点云时具有挑战性。
Multiview CNNs: [23, 18] have tried to render 3D point cloud or shapes into 2D images and then apply 2D conv nets to classify them. With well engineered image CNNs, this line of methods have achieved dominating performance on shape classification and retrieval tasks [21]. However, it’s nontrivial to extend them to scene understanding or other 3D tasks such as point classification and shape completion.
多视图卷积神经网络:这些文献尝试将三维点云或形状渲染为二维图像,然后应用二维卷积网络对其进行分类。通过精心设计的图像卷积神经网络,这类方法在形状分类和检索任务中取得了主导表现。然而,将它们扩展到场景理解或其他三维任务(如点分类和形状完成)并不简单。
Spectral CNNs: Some latest works [4, 16] use spectral CNNs on meshes. However, these methods are currently constrained on manifold meshes such as organic objects and it’s not obvious how to extend them to non-isometric shapes such as furniture.
频谱卷积神经网络:一些最新的研究在网格上使用了频谱卷积神经网络。然而,这些方法目前仅限于流形网格(如有机物体),如何将它们扩展到非等距形状(如家具)并不明确。
Feature-based DNNs: [6, 8] firstly convert the 3D data into a vector, by extracting traditional shape features and then use a fully connected net to classify the shape. We think they are constrained by the representation power of the features extracted.
基于特征的深度神经网络:这些文献首先通过提取传统形状特征,将三维数据转换为向量,然后使用全连接网络对形状进行分类。我们认为这些方法受限于提取特征的表达能力。
Deep Learning on Unordered Sets 无序集合的深度学习
From a data structure point of view, a point cloud is an unordered set of vectors. While most works in deep learning focus on regular input representations like sequences (in speech and language processing), images and volumes (video or 3D data), not much work has been done in deep learning on point sets.
从数据结构的角度看,点云是一个无序的向量集合。尽管深度学习的多数工作集中在有序输入表示(如序列、图像和体积数据)上,但在无序点集上的研究不多。
One recent work from Oriol Vinyals et al [25] looks into this problem. They use a read-process-write network with attention mechanism to consume unordered input sets and show that their network has the ability to sort numbers. However, since their work focuses on generic sets and NLP applications, there lacks the role of geometry in the sets.
Oriol Vinyals等人的一项最近工作研究了这个问题。他们使用了带有注意力机制的读-处理-写网络来处理无序输入集合,并展示了其网络具有排序数字的能力。然而,由于他们的工作集中于通用集合和自然语言处理应用,几何信息在集合中的作用有所欠缺。
3. Problem Statement 问题陈述
We design a deep learning framework that directly consumes unordered point sets as inputs. A point cloud is represented as a set of 3D points {Pi | i = 1, …, n} , where each point Pi is a vector of its (x, y, z) coordinate plus extra feature channels such as color, normal etc. For simplicity and clarity, unless otherwise noted, we only use the (x, y, z) coordinate as our point’s channels.
我们设计了一个深度学习框架,能够直接以无序点集作为输入。一个点云表示为一组三维点
P
i
P_i
Pi,其中
i
=
1
,
.
.
.
,
n
i = 1, ..., n
i=1,...,n,每个点
P
i
P_i
Pi 都是一个向量,包含其
(
x
,
y
,
z
)
(x, y, z)
(x,y,z) 坐标以及其他特征通道(如颜色、法线等)。为了简化和清晰起见,除非特别说明,否则我们仅使用
(
x
,
y
,
z
)
(x, y, z)
(x,y,z) 坐标作为点的特征通道。
For the object classification task, the input point cloud is either directly sampled from a shape or pre-segmented from a scene point cloud. Our proposed deep network outputs k scores for all the k candidate classes. For semantic segmentation, the input can be a single object for part region segmentation, or a sub-volume from a 3D scene for object region segmentation. Our model will output n m scores for each of the n points and each of the m semantic sub- categories.
对于物体分类任务,输入的点云可以直接从一个形状中采样,或者从场景点云中预分割得到。我们提出的深度网络会为所有
k
k
k 个候选类别输出
k
k
k 个得分。对于语义分割任务,输入可以是单个对象用于部件区域分割,或来自三维场景的子体积用于物体区域分割。我们的模型将输出每个点
n
n
n 个得分,分别对应
m
m
m 个语义子类别中的每一个。
Figure 2. PointNet Architecture. The classification network takes n points as input, applies input and feature transformations, and then aggregates point features by max pooling. The output is classification scores for k classes. The segmentation network is an extension to the classification net. It concatenates global and local features and outputs per point scores. “mlp” stands for multi-layer perceptron, numbers in bracket are layer sizes. Batchnorm is used for all layers with ReLU. Dropout layers are used for the last mlp in classification net.
图2. PointNet架构。分类网络以
n
n
n 个点作为输入,应用输入和特征变换,然后通过最大池化汇聚点特征。输出为
k
k
k 个类别的分类得分。分割网络是分类网络的扩展,它将全局和局部特征进行拼接,并为每个点输出得分。“mlp” 代表多层感知机,括号中的数字表示层的大小。所有层都使用了带ReLU的Batchnorm,分类网络中的最后一个多层感知机使用了Dropout层。
4. Deep Learning on Point Sets 点集上的深度学习
The architecture of our network (Sec 4.2) is inspired by the properties of point sets in Rn (Sec 4.1).
我们的网络架构(第4.2节)受到
R
n
R^n
Rn 中点集性质的启发(第4.1节)。
4.1. Properties of Point Sets in R n R^n Rn R n R^n Rn 中点集的性质
Our input is a subset of points from an Euclidean space.It has three main properties:
我们的输入是一部分来自欧氏空间的点集,它具有以下三个主要性质:
-
Unordered. Unlike pixel arrays in images or voxel arrays in volumetric grids, point cloud is a set of points without specific order. In other words, a network that consumes N 3D point sets needs to be invariant to N ! permutations of the input set in data feeding order.
无序性:与图像中的像素数组或体素网格中的体素数组不同,点云是一个没有特定顺序的点集。换句话说,能够处理 N N N 个三维点集的网络需要对输入集在数据输入顺序上的 N ! N! N! 种排列保持不变性。 -
Interaction among points. The points are from a space with a distance metric. It means that points are not isolated, and neighboring points form a meaningful subset. Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures.
点之间的交互:这些点来自一个具有距离度量的空间。这意味着点并不是孤立的,邻近的点形成了有意义的子集。因此,模型需要能够捕捉来自邻近点的局部结构以及局部结构之间的组合交互。 -
Invariance under transformations. As a geometric object, the learned representation of the point set should be invariant to certain transformations. For example, rotating and translating points all together should not modify the global point cloud category nor the segmentation of the points.
变换下的不变性:作为几何对象,点集的学习表示应该对某些变换保持不变。例如,将所有点同时旋转或平移不应改变点云的整体类别,也不应影响点的分割结果。
4.2. PointNet Architecture PointNet 架构
Our full network architecture is visualized in Fig 2, where the classification network and the segmentation network share a great portion of structures. Please read the caption of Fig 2 for the pipeline.
我们的完整网络架构在图2中可视化展示,其中分类网络和分割网络在结构上有很大部分是共享的。请参考图2的说明以了解具体流程。
Our network has three key modules: the max pooling layer as a symmetric function to aggregate information from all the points, a local and global information combination structure, and two joint alignment networks that align both input points and point features.
我们的网络有三个关键模块:最大池化层(作为一个对称函数,用于聚合所有点的信息)、局部和全局信息的组合结构、以及两个联合对齐网络,用于对齐输入点和点特征。
We will discuss our reason behind these design choices in separate paragraphs below.
在下文中,我们将逐段讨论这些设计选择背后的原因。
Symmetry Function for Unordered Input 无序输入的对称函数
In order to make a model invariant to input permutation, three strategies exist: 1) sort input into a canonical order; 2) treat the input as a sequence to train an RNN, but augment the training data by all kinds of permutations; 3) use a simple symmetric function to aggregate the information from each point. Here, a symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order. For example, + and * operators are symmetric binary functions.
为了使模型对输入的排列保持不变性,有三种策略:1)将输入排序为标准顺序;2)将输入视为序列来训练一个RNN,并通过各种排列来扩充训练数据;3)使用简单的对称函数来聚合每个点的信息。这里,对称函数接收
n
n
n 个向量作为输入并输出一个对输入顺序不变的新向量。例如,加法和乘法操作都是对称的二元函数。
While sorting sounds like a simple solution, in high dimensional space there in fact does not exist an ordering that is stable w.r.t. point perturbations in the general sense. This can be easily shown by contradiction. If such an ordering strategy exists, it defines a bijection map between a high-dimensional space and a 1d real line. It is not hard to see, to require an ordering to be stable w.r.t point perturbations is equivalent to requiring that this map preserves spatial proximity as the dimension reduces, a task that cannot be achieved in the general case. Therefore, sorting does not fully resolve the ordering issue, and it’s hard for a network to learn a consistent mapping from input to output as the ordering issue persists. As shown in experiments (Fig 5), we find that applying a MLP directly on the sorted point set performs poorly, though slightly better than directly processing an unsorted input.
虽然排序看起来是一个简单的解决方案,但在高维空间中,事实上不存在一种对点扰动稳定的通用排序方法。这可以通过反证法容易地证明。如果这种排序策略存在,它将定义一个从高维空间到一维实数线的双射映射。不难看出,要求排序对点扰动稳定等同于要求此映射在降维时保留空间的邻近性,而这是一般情况下无法实现的。因此,排序无法完全解决顺序问题,并且由于顺序问题的存在,网络难以学习一致的输入输出映射。如实验(图5)所示,我们发现直接对排序后的点集应用MLP表现较差,尽管略优于直接处理未排序输入。
The idea to use RNN considers the point set as a sequential signal and hopes that by training the RNN with randomly permuted sequences, the RNN will become invariant to input order. However in “OrderMatters” [25] the authors have shown that order does matter and cannot be totally omitted. While RNN has relatively good robustness to input ordering for sequences with small length (dozens), it’s hard to scale to thousands of input elements, which is the common size for point sets. Empirically, we have also shown that model based on RNN does not perform as well as our proposed method (Fig 5).
使用RNN的方法将点集视为序列信号,并希望通过对随机排列的序列训练RNN,使其对输入顺序不敏感。然而,在“OrderMatters”[25]中,作者已表明顺序确实重要,无法完全忽略。尽管RNN在处理小长度序列(数十个元素)时对输入顺序有相对较好的鲁棒性,但难以扩展到成千上万个输入元素,这是点集的常见规模。通过实验证明,基于RNN的模型表现不如我们提出的方法(图5)。
Our idea is to approximate a general function defined on a point set by applying a symmetric function on transformed elements in the set:
我们的想法是,通过在集合中的转换元素上应用对称函数,来近似定义在点集上的通用函数:
f
(
{
x
1
,
…
,
x
n
}
)
≈
g
(
h
(
x
1
)
,
…
,
h
(
x
n
)
)
,
(1)
f ( \{x_{1}, \ldots, x_{n} \} ) \approx g ( h ( x_{1} ), \ldots, h ( x_{n} ) ) , \tag{1}
f({x1,…,xn})≈g(h(x1),…,h(xn)),(1)where
f
:
2
R
N
→
R
,
h
:
R
N
→
R
K
f : 2^{\mathbb{R}^N} \to \mathbb{R}, \ h : \mathbb{R}^N \to \mathbb{R}^K
f:2RN→R, h:RN→RK and
g
:
R
K
×
⋯
×
R
K
⏟
n
→
R
g : \underbrace{\mathbb{R}^K \times \dots \times \mathbb{R}^K}_{n} \to \mathbb{R}
g:n
RK×⋯×RK→R is a symmetric function.
Empirically, our basic module is very simple: we approximate h by a multi-layer perceptron network and g by a composition of a single variable function and a max pooling function. This is found to work well by experiments. Through a collection of h, we can learn a number of f ’s to capture different properties of the set.
从经验上看,我们的基本模块非常简单:我们使用多层感知机网络近似函数
h
h
h,并使用单变量函数与最大池化函数的组合来近似函数
g
g
g。实验表明,这种方法效果良好。通过一系列的
h
h
h,我们可以学习多个
f
f
f,从而捕捉点集的不同性质。
While our key module seems simple, it has interesting properties (see Sec 5.3) and can achieve strong performace (see Sec 5.1) in a few different applications. Due to the simplicity of our module, we are also able to provide theoretical analysis as in Sec 4.3.
虽然我们的关键模块看起来很简单,但它具有一些有趣的性质(见第5.3节),并且在多个不同的应用中能够实现强大的性能(见第5.1节)。由于这个模块的简洁性,我们还能够进行理论分析,如在第4.3节中所述。
Local and Global Information Aggregation 局部与全局信息聚合
The output from the above section forms a vector [f1, . . . , fK], which is a global signature of the input set. We can easily train a SVM or multi-layer perceptron classifier on the shape global features for classification. However, point segmentation requires a combination of local and global knowledge. We can achieve this by a simple yet highly effective manner.
前述部分的输出形成了一个向量
[
f
1
,
…
,
f
K
]
[f_1, \dots, f_K]
[f1,…,fK],这是输入点集的全局特征表示。我们可以轻松地在这些形状的全局特征上训练一个SVM或多层感知机分类器进行分类。然而,点分割任务要求结合局部和全局信息。这一点我们可以通过一种简单而高效的方法来实现。
Our solution can be seen in Fig 2 (Segmentation Net- work). After computing the global point cloud feature vec- tor, we feed it back to per point features by concatenating the global feature with each of the point features. Then we extract new per point features based on the combined point features - this time the per point feature is aware of both the local and global information.
我们的解决方案可以在图2的分割网络中看到。在计算出全局点云特征向量后,我们将其与每个点的特征进行拼接,然后基于这种组合后的点特征提取新的每点特征——这一次,每个点的特征同时包含了局部和全局信息。
With this modification our network is able to predict per point quantities that rely on both local geometry and global semantics. For example we can accurately predict per-point normals (fig in supplementary), validating that the network is able to summarize information from the point’s local neighborhood. In experiment session, we also show that our model can achieve state-of-the-art performance on shape part segmentation and scene segmentation.
通过这一修改,我们的网络能够预测依赖于局部几何和全局语义的每点信息。例如,我们可以准确预测每个点的法线(见附录中的图),这验证了网络能够总结来自点局部邻域的信息。在实验部分,我们还展示了我们的模型在形状部件分割和场景分割任务上实现了当前最先进的性能。
Joint Alignment Network 联合对齐网络
The semantic labeling of a point cloud has to be invariant if the point cloud undergoes certain geometric transformations, such as rigid transforma- tion. We therefore expect that the learnt representation by our point set is invariant to these transformations.
如果点云经历某些几何变换(如刚性变换),其语义标注必须保持不变。因此,我们希望通过点集学习到的表示能够对这些变换保持不变。
A natural solution is to align all input set to a canonical space before feature extraction. Jaderberg et al. [9] introduces the idea of spatial transformer to align 2D images through sampling and interpolation, achieved by a specifically tailored layer implemented on GPU.
一种自然的解决方案是,在特征提取之前,将所有输入点集对齐到一个标准空间。Jaderberg等人[9]引入了空间变换器的概念,通过采样和插值对二维图像进行对齐,这一过程是通过在GPU上实现的特定定制层完成的。
Our input form of point clouds allows us to achieve this goal in a much simpler way compared with [9]. We do not need to invent any new layers and no alias is introduced as in the image case. We predict an affine transformation matrix by a mini-network (T-net in Fig 2) and directly apply this transformation to the coordinates of input points. The mini- network itself resembles the big network and is composed by basic modules of point independent feature extraction, max pooling and fully connected layers. More details about the T-net are in the supplementary.
相比于[9]中的方法,我们的点云输入形式使得我们能够以更简单的方式实现这一目标。我们不需要发明任何新的层,也不会像图像处理中那样引入混叠效应。我们通过一个小型网络(图2中的T-net)预测一个仿射变换矩阵,并将该变换直接应用于输入点的坐标。这个小型网络本身类似于大网络,由点独立的特征提取模块、最大池化层和全连接层组成。关于T-net的更多细节见附录。
This idea can be further extended to the alignment of feature space, as well. We can insert another alignment net- work on point features and predict a feature transformation matrix to align features from different input point clouds. However, transformation matrix in the feature space has much higher dimension than the spatial transform matrix, which greatly increases the difficulty of optimization. We therefore add a regularization term to our softmax training loss. We constrain the feature transformation matrix to be close to orthogonal matrix:
这个思路也可以进一步扩展到特征空间的对齐。我们可以在点特征上插入另一个对齐网络,并预测一个特征变换矩阵,以对齐来自不同输入点云的特征。然而,特征空间中的变换矩阵维度比空间变换矩阵高得多,这大大增加了优化的难度。因此,我们在softmax训练损失中加入了一个正则项,限制特征变换矩阵接近于正交矩阵:
L reg = ∥ I − A A T ∥ F 2 , (2) L_{\text{reg}} = \|I - AA^T\|_F^2, \tag{2} Lreg=∥I−AAT∥F2,(2)
where A is the feature alignment matrix predicted by a mini-network. An orthogonal transformation will not lose information in the input, thus is desired. We find that by adding the regularization term, the optimization becomes more stable and our model achieves better performance.
其中,
A
A
A 是由一个小型网络预测的特征对齐矩阵。正交变换不会丢失输入中的信息,因此这是理想的。我们发现,通过添加这个正则项,优化过程变得更加稳定,且我们的模型性能有所提升。
4.3. Theoretical Analysis 理论分析
Universal approximation 通用逼近性
We first show the universal approximation ability of our neural network to continuous set functions. By the continuity of set functions, intuitively, a small perturbation to the input point set should not greatly change the function values, such as classification or segmentation scores.
我们首先展示了我们的神经网络对连续集合函数的通用逼近能力。通过集合函数的连续性,直观上,输入点集的微小扰动不应显著改变函数值,比如分类或分割的得分。
Formally, let
X
=
{
S
:
S
⊆
[
0
,
1
]
m
and
∣
S
∣
=
n
}
\mathcal{X} = \{S : S \subseteq [0, 1]^m \text{ and } |S| = n \}
X={S:S⊆[0,1]m and ∣S∣=n},
f
:
X
→
R
f : \mathcal{X} \to \mathbb{R}
f:X→R is a continuous set function on
X
\mathcal{X}
X w.r.t to Hausdorff distance
d
H
(
⋅
,
⋅
)
d_H(\cdot, \cdot)
dH(⋅,⋅), i.e.,
∀
ϵ
>
0
\forall \epsilon > 0
∀ϵ>0,
∃
δ
>
0
\exists \delta > 0
∃δ>0, for any
S
,
S
′
∈
X
S, S' \in \mathcal{X}
S,S′∈X, if
d
H
(
S
,
S
′
)
<
δ
d_H(S, S') < \delta
dH(S,S′)<δ, then
∣
f
(
S
)
−
f
(
S
′
)
∣
<
ϵ
|f(S) - f(S')| < \epsilon
∣f(S)−f(S′)∣<ϵ. Our theorem says that
f
f
f can be arbitrarily approximated by our network given enough neurons at the max pooling layer, i.e.,
K
K
K in (1) is sufficiently large.
正式地,令
X
=
{
S
:
S
⊆
[
0
,
1
]
m
且
∣
S
∣
=
n
}
\mathcal{X} = \{S : S \subseteq [0, 1]^m \text{ 且 } |S| = n \}
X={S:S⊆[0,1]m 且 ∣S∣=n},
f
:
X
→
R
f : \mathcal{X} \to \mathbb{R}
f:X→R 是关于Hausdorff距离
d
H
(
⋅
,
⋅
)
d_H(\cdot, \cdot)
dH(⋅,⋅) 的连续集合函数,即,
∀
ϵ
>
0
\forall \epsilon > 0
∀ϵ>0,
∃
δ
>
0
\exists \delta > 0
∃δ>0,对于任意的
S
,
S
′
∈
X
S, S' \in \mathcal{X}
S,S′∈X,如果
d
H
(
S
,
S
′
)
<
δ
d_H(S, S') < \delta
dH(S,S′)<δ,则
∣
f
(
S
)
−
f
(
S
′
)
∣
<
ϵ
|f(S) - f(S')| < \epsilon
∣f(S)−f(S′)∣<ϵ。我们的定理表明,给定最大池化层有足够多的神经元(即(1)式中的
K
K
K 足够大),则可以通过我们的网络对
f
f
f 进行任意精度的逼近。
Figure 3. Qualitative results for part segmentation. We visualize the CAD part segmentation results across all 16 object categories. We show both results for partial simulated Kinect scans (left block) and complete ShapeNet CAD models (right block).
图3. 部件分割的定性结果。我们展示了在所有16个物体类别中的CAD部件分割结果。图中显示了部分模拟的Kinect扫描结果(左侧块)和完整的ShapeNet CAD模型分割结果(右侧块)。
Theorem 1. Suppose
f
:
X
→
R
f : \mathcal{X} \to \mathbb{R}
f:X→R is a continuous set function w.r.t Hausdorff distance
d
H
(
⋅
,
⋅
)
d_H(\cdot, \cdot)
dH(⋅,⋅).
∀
ϵ
>
0
\forall \epsilon > 0
∀ϵ>0,
∃
\exists
∃ a continuous function
h
h
h and a symmetric function
g
(
x
1
,
…
,
x
n
)
=
γ
o MAX
g(x_1, \dots, x_n) = \gamma \text{ o MAX}
g(x1,…,xn)=γ o MAX, such that for any
S
∈
X
S \in \mathcal{X}
S∈X,
定理 1. 假设
f
:
X
→
R
f : \mathcal{X} \to \mathbb{R}
f:X→R 是关于Hausdorff距离
d
H
(
⋅
,
⋅
)
d_H(\cdot, \cdot)
dH(⋅,⋅) 的连续集合函数。对于任意
ϵ
>
0
\epsilon > 0
ϵ>0,存在一个连续函数
h
h
h 和一个对称函数
g
(
x
1
,
…
,
x
n
)
=
γ
∘
MAX
g(x_1, \dots, x_n) = \gamma \circ \text{MAX}
g(x1,…,xn)=γ∘MAX,使得对于任何
S
∈
X
S \in \mathcal{X}
S∈X,满足
∣ f ( S ) − γ ( max x i ∈ S { h ( x i ) } ) ∣ < ϵ | f(S) - \gamma \left( \max_{x_i \in S} \{ h(x_i) \} \right) | < \epsilon ∣f(S)−γ(xi∈Smax{h(xi)})∣<ϵ
where
x
1
,
…
,
x
n
x_1, \ldots, x_n
x1,…,xn is the full list of elements in
S
S
S ordered arbitrarily,
γ
\gamma
γ is a continuous function, and MAX is a vector max operator that takes
n
n
n vectors as input and returns a new vector of the element-wise maximum.
其中,
x
1
,
…
,
x
n
x_1, \dots, x_n
x1,…,xn 是
S
S
S 中所有元素的任意排列,
γ
\gamma
γ 是一个连续函数,MAX 是一个向量最大值操作符,它接收
n
n
n 个向量作为输入,并返回一个按元素取最大值的新向量。
The proof to this theorem can be found in our supple- mentary material. The key idea is that in the worst case the network can learn to convert a point cloud into a volumetric representation, by partitioning the space into equal-sized voxels. In practice, however, the network learns a much smarter strategy to probe the space, as we shall see in point function visualizations.
该定理的证明可以在我们的附录材料中找到。关键思想是,在最坏情况下,网络可以通过将空间划分为等大小的体素,将点云转换为体积表示。然而,实际中,网络会学习一种更智能的空间探索策略,如我们在点函数的可视化中所见。
Bottleneck dimension and stability 瓶颈维度与稳定性
Theoretically and experimentally, we find that the expressiveness of our network is strongly affected by the dimension of the max pooling layer, i.e.,
K
K
K in (1). Here we provide an analysis, which also reveals properties related to the stability of our model.
从理论和实验上来看,我们发现网络的表达能力受到最大池化层维度的强烈影响,即(1)式中的
K
K
K。在此,我们提供一个分析,揭示模型稳定性相关的性质。
We define
u
=
max
x
i
∈
S
{
h
(
x
i
)
}
u = \max_{x_i \in S} \{h(x_i)\}
u=maxxi∈S{h(xi)} to be the sub-network of
f
f
f which maps a point set in
[
0
,
1
]
m
[0, 1]^m
[0,1]m to a
K
K
K-dimensional vector. The following theorem tells us that small corruptions or extra noise points in the input set are not likely to change the output of our network:
我们定义
u
=
max
x
i
∈
S
{
h
(
x
i
)
}
u = \max_{x_i \in S} \{h(x_i)\}
u=maxxi∈S{h(xi)} 为
f
f
f 的子网络,它将
[
0
,
1
]
m
[0, 1]^m
[0,1]m 中的点集映射到一个
K
K
K 维向量。以下定理表明,输入集中的小扰动或额外噪声点不太可能改变网络的输出:
Theorem 2. Suppose u : X → R K u : \mathcal{X} \to \mathbb{R}^K u:X→RK such that u = max x i ∈ S { h ( x i ) } u = \max_{x_i \in S} \{h(x_i)\} u=maxxi∈S{h(xi)} and f = γ ∘ u f = \gamma \circ u f=γ∘u. Then,
(a) ∀ S , ∃ c S , N S ⊆ X , f ( T ) = f ( S ) \forall S, \ \exists c_S, N_S \subseteq \mathcal{X}, \ f(T) = f(S) ∀S, ∃cS,NS⊆X, f(T)=f(S) if c S ⊆ T ⊆ N S ; c_S \subseteq T \subseteq N_S; cS⊆T⊆NS;
(b) ∣ c S ∣ ≤ K |c_S| \leq K ∣cS∣≤K
We explain the implications of the theorem. (a) says that
f
(
S
)
f(S)
f(S) is unchanged up to the input corruption if all points in
C
S
C_S
CS are preserved; it is also unchanged with extra noise points up to
N
S
N_S
NS. (b) says that
C
S
C_S
CS only contains a bounded number of points, determined by
K
K
K in (1). In other words,
f
(
S
)
f(S)
f(S) is in fact totally determined by a finite subset
C
S
⊆
S
C_S \subseteq S
CS⊆S of less or equal to
K
K
K elements. We therefore call
C
S
C_S
CS the critical point set of
S
S
S and
K
K
K the bottleneck dimension of
f
f
f.
我们解释该定理的含义。(a)指出,如果
C
S
C_S
CS 中的所有点都被保留,那么即使输入受到一些损坏,
f
(
S
)
f(S)
f(S) 也不会发生变化;同样,即使有额外的噪声点,只要不超过
N
S
N_S
NS,
f
(
S
)
f(S)
f(S) 也保持不变。(b)说明
C
S
C_S
CS 仅包含有限数量的点,其数量由 (1) 式中的
K
K
K 决定。换句话说,
f
(
S
)
f(S)
f(S) 实际上完全由一个有限子集
C
S
⊆
S
C_S \subseteq S
CS⊆S 确定,且这个子集的元素数量不超过
K
K
K。因此,我们称
C
S
C_S
CS 为
S
S
S 的关键点集,而
K
K
K 为
f
f
f 的瓶颈维度。
Combined with the continuity of
h
h
h, this explains the robustness of our model w.r.t point perturbation, corruption and extra noise points. The robustness is gained in analogy to the sparsity principle in machine learning models. Intuitively, our network learns to summarize a shape by a sparse set of key points. In the experiment section we see that the key points form the skeleton of an object.
结合
h
h
h 的连续性,这解释了我们的模型在点扰动、损坏和额外噪声点方面的鲁棒性。这种鲁棒性与机器学习模型中的稀疏性原则类似。**直观上,我们的网络学习通过一组稀疏的关键点来概括形状。**在实验部分,我们可以看到这些关键点构成了物体的“骨架”。
Table 1. Classification results on ModelNet40. Our net achieves state-of-the-art among deep nets on 3D input.
表1. ModelNet40上的分类结果。我们的网络在三维输入的深度网络中达到了当前最先进的性能。
5. Experiment 实验
Experiments are divided into four parts. First, we show PointNets can be applied to multiple 3D recognition tasks (Sec 5.1). Second, we provide detailed experiments to validate our network design (Sec 5.2). At last we visualize what the network learns (Sec 5.3) and analyze time and space complexity (Sec 5.4).
实验分为四个部分。首先,我们展示了PointNet可以应用于多种三维识别任务(第5.1节)。其次,我们提供了详细的实验来验证网络设计(第5.2节)。最后,我们可视化网络的学习内容(第5.3节)并分析时间和空间复杂性(第5.4节)。
5.1. Applications 应用
In this section we show how our network can be trained to perform 3D object classification, object part segmentation, and semantic scene segmentation. Even though we are working on a brand new data representation (point sets), we are able to achieve comparable or even better performance on benchmarks for several tasks.
在本节中,我们展示了如何训练我们的网络来执行三维物体分类、物体部件分割和语义场景分割。尽管我们采用了一种全新的数据表示形式(点集),但在多个任务的基准测试中,我们仍然能够实现相当或更好的性能。
3D Object Classification 三维物体分类
Our network learns global point cloud feature that can be used for object classification. We evaluate our model on the ModelNet40 [28] shape classification benchmark. There are 12,311 CAD models from 40 man-made object categories, split into 9,843 for training and 2,468 for testing. While previous methods focus on volumetric and mult-view image representations, we are the first to directly work on raw point cloud.
我们的网络学习了全局点云特征,可用于物体分类。我们在ModelNet40 [28] 形状分类基准测试上评估了模型性能。该数据集包含来自40个人工物体类别的12,311个CAD模型,划分为9,843个用于训练和2,468个用于测试。此前的方法主要关注体素和多视角图像表示,而我们是首个直接在原始点云上进行操作的方法。
We uniformly sample 1024 points on mesh faces accord- ing to face area and normalize them into a unit sphere. During training we augment the point cloud on-the-fly by randomly rotating the object along the up-axis and jitter the position of each points by a Gaussian noise with zero mean and 0.02 standard deviation.
我们在网格面上根据面面积均匀采样1024个点,并将它们归一化到单位球体。在训练过程中,通过随机沿上轴旋转物体和对每个点的位置进行高斯噪声(均值为0,标准差为0.02)的扰动,动态增强点云。
In Table 1, we compare our model with previous works as well as our baseline using MLP on traditional features extracted from point cloud (point density, D2, shape contour etc.). Our model achieved state-of-the-art performance among methods based on 3D input (volumetric and point cloud). With only fully connected layers and max pooling, our net gains a strong lead in inference speed and can be easily parallelized in CPU as well. There is still a small gap between our method and multi-view based method (MVCNN [23]), which we think is due to the loss of fine geometry details that can be captured by rendered images.
在表1中,我们将模型与先前的工作以及使用点云传统特征(如点密度、D2、形状轮廓等)通过MLP的基线进行比较。我们的模型在基于三维输入(体素和点云)的方法中实现了最先进的性能。仅使用全连接层和最大池化,我们的网络在推理速度上具有显著优势,也可以轻松在CPU上并行化。尽管与多视角方法(MVCNN [23])之间仍有小差距,我们认为这可能是由于图像渲染捕获的精细几何细节损失所致。
3D Object Part Segmentation 物体部件分割
Part segmentation is a challenging fine-grained 3D recognition task. Given a 3D scan or a mesh model, the task is to assign part category label (e.g. chair leg, cup handle) to each point or face.
部件分割是一项具有挑战性的细粒度三维识别任务。给定一个三维扫描或网格模型,任务是为每个点或面分配部件类别标签(如椅子腿、杯子手柄)。
We evaluate on ShapeNet part data set from [29], which contains 16,881 shapes from 16 categories, annotated with 50 parts in total. Most object categories are labeled with two to five parts. Ground truth annotations are labeled on sampled points on the shapes.
我们在ShapeNet部件数据集上进行评估,该数据集包含来自16个类别的16,881个形状,总共标注了50种部件。大多数物体类别标注了两到五个部件。真实标签是对形状上采样点的标注。
We formulate part segmentation as a per-point classifi- cation problem. Evaluation metric is mIoU on points. For each shape S of category C, to calculate the shape’s mIoU: For each part type in category C, compute IoU between groundtruth and prediction. If the union of groundtruth and prediction points is empty, then count part IoU as 1. Then we average IoUs for all part types in category C to get mIoU for that shape. To calculate mIoU for the category, we take average of mIoUs for all shapes in that category.
我们将部件分割设定为每点分类问题,评价指标是点上的平均交并比(mIoU)。对于类别
C
C
C 中的每个形状
S
S
S,计算其mIoU的方法是:对于类别
C
C
C 中的每个部件类型,计算预测和真实标签之间的IoU。如果预测和真实标签的点集合为空,则该部件的IoU计为1。然后,平均类别
C
C
C 中所有部件类型的IoU以获得该形状的mIoU。对于类别的mIoU,我们取该类别中所有形状的mIoU平均值。
In this section, we compare our segmentation version PointNet (a modified version of Fig 2, Segmentation Network) with two traditional methods [27] and [29] that both take advantage of point-wise geometry features and correspondences between shapes, as well as our own 3D CNN baseline. See supplementary for the detailed modifications and network architecture for the 3D CNN.
在本节中,我们将分割版PointNet(图2中的分割网络的改进版)与两种传统方法[27]和[29]进行比较,这两种方法均利用了点的几何特征以及形状之间的对应关系。同时,我们还对比了我们自己的3D CNN基线。详细的修改和3D CNN的网络架构请见附录。
In Table 2, we report per-category and mean IoU(%) scores. We observe a 2.3% mean IoU improvement and our net beats the baseline methods in most categories.
在表2中,我们报告了每个类别和总体的IoU(%)得分。我们观察到平均IoU提升了2.3%,并且我们的网络在大多数类别中优于基线方法。
Table 2. Segmentation results on ShapeNet part dataset. Metric is mIoU(%) on points. We compare with two traditional methods [27] and [29] and a 3D fully convolutional network baseline proposed by us. Our PointNet method achieved the state-of-the-art in mIoU.
表2. ShapeNet部件数据集上的分割结果。评价指标是点上的平均交并比(mIoU%)。我们与两种传统方法[27]和[29]以及我们提出的3D全卷积网络基线进行了比较。我们的PointNet方法在mIoU上实现了当前最先进的性能。
We also perform experiments on simulated Kinect scans to test the robustness of these methods. For every CAD model in the ShapeNet part data set, we use Blensor Kinect Simulator [7] to generate incomplete point clouds from six random viewpoints. We train our PointNet on the complete shapes and partial scans with the same network architecture and training setting. Results show that we lose only 5.3% mean IoU. In Fig 3, we present qualitative results on both complete and partial data. One can see that though partial data is fairly challenging, our predictions are reasonable.
我们还在模拟的Kinect扫描数据上进行了实验,以测试这些方法的鲁棒性。对于ShapeNet部件数据集中的每个CAD模型,我们使用Blensor Kinect模拟器[7]从六个随机视角生成不完整的点云。我们在完整形状和部分扫描数据上使用相同的网络架构和训练设置来训练PointNet。结果显示,我们的平均IoU仅下降了5.3%。在图3中,我们展示了完整和部分数据的定性结果。可以看到,尽管部分数据具有相当的挑战性,我们的预测仍然合理。
Semantic Segmentation in Scenes 场景中的语义分割
Our network on part segmentation can be easily extended to semantic scene segmentation, where point labels become semantic object classes instead of object part labels.
我们的部件分割网络可以很容易地扩展到场景中的语义分割,其中点的标签变为语义物体类别(如椅子、桌子)而非物体部件标签。
We experiment on the Stanford 3D semantic parsing data set [1]. The dataset contains 3D scans from Matterport scanners in 6 areas including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall etc. plus clutter).
我们在Stanford 3D语义解析数据集[1]上进行了实验。该数据集包含来自Matterport扫描仪的三维扫描数据,涵盖6个区域中的271个房间。扫描中每个点都标注了13个类别之一的语义标签(如椅子、桌子、地板、墙壁等,外加杂物)。
To prepare training data, we firstly split points by room, and then sample rooms into blocks with area 1m by 1m. We train our segmentation version of PointNet to predict per point class in each block. Each point is represented by a 9-dim vector of XYZ, RGB and normalized location as to the room (from 0 to 1). At training time, we randomly sample 4096 points in each block on-the-fly. At test time, we test on all the points. We follow the same protocol as [1] to use k-fold strategy for train and test.
为准备训练数据,我们首先按房间分割点,然后将房间分割成1米×1米的区域块。我们训练PointNet的分割版本以预测每个块中每点的类别。每个点由一个9维向量表示,包括XYZ坐标、RGB值和归一化位置(0到1之间)。在训练时,我们动态采样每个块中的4096个点;测试时,测试所有点。我们采用与[1]相同的k折策略进行训练和测试。
We compare our method with a baseline using hand- crafted point features. The baseline extracts the same 9- dim local features and three additional ones: local point density, local curvature and normal. We use standard MLP as the classifier. Results are shown in Table 3, where our PointNet method significantly outperforms the baseline method. In Fig 4, we show qualitative segmentation results. Our network is able to output smooth predictions and is robust to missing points and occlusions.
我们将方法与使用手工设计点特征的基线进行比较。该基线提取了相同的9维局部特征以及三个额外的特征:局部点密度、局部曲率和法线。我们使用标准MLP作为分类器。结果如表3所示,PointNet方法显著优于基线方法。在图4中,我们展示了定性的分割结果。我们的网络能够输出平滑的预测,并且对缺失点和遮挡具有较强的鲁棒性。
Based on the semantic segmentation output from our network, we further build a 3D object detection system using connected component for object proposal (see sup- plementary for details). We compare with previous state- of-the-art method in Table 4. The previous method is based on a sliding shape method (with CRF post processing) with SVMs trained on local geometric features and global room context feature in voxel grids. Our method outperforms it by a large margin on the furniture categories reported.
基于网络的语义分割输出,我们进一步使用连通组件生成物体候选区域,构建了一个三维物体检测系统(详情见附录)。我们在表4中与先前的最先进方法进行了比较。先前的方法基于滑动窗口方法(配合CRF后处理),并在体素网格中的局部几何特征和全局房间上下文特征上训练SVM。我们的方法在家具类别上显著优于先前的方法。
Table 3. Results on semantic segmentation in scenes. Metric is average IoU over 13 classes (structural and furniture elements plus clutter) and classification accuracy calculated on points.
表3. 场景中的语义分割结果。评价指标为13个类别(结构元素、家具元素和杂物)的平均IoU和基于点的分类准确率。
Table 4. Results on 3D object detection in scenes. Metric is average precision with threshold IoU 0.5 computed in 3D volumes.
表4. 场景中三维物体检测的结果。评价指标为平均精度(AP),基于3D体积计算,IoU阈值为0.5。
Figure 4. Qualitative results for semantic segmentation. Top row is input point cloud with color. Bottom row is output semantic segmentation result (on points) displayed in the same camera viewpoint as input.
图4. 语义分割的定性结果。上排为带颜色的输入点云,下排为输出的语义分割结果(在点上),与输入显示相同的摄像机视角。
5.2. Architecture Design Analysis 架构设计分析
In this section we validate our design choices by control experiments. We also show the effects of our network’s hyperparameters.
在本节中,我们通过控制实验验证设计选择的合理性。同时,我们展示了网络超参数的影响。
Comparison with Alternative Order-invariant Methods 与其他顺序不变方法的比较
As mentioned in Sec 4.2, there are at least three options for consuming unordered set inputs. We use the ModelNet40 shape classification problem as a test bed for comparisons of those options, the following two control experiment will also use this task.
如第4.2节所述,处理无序集合输入至少有三种方案。我们使用ModelNet40形状分类问题作为测试基准,以比较这些方案,接下来的两个控制实验也将使用此任务。
The baselines (illustrated in Fig 5) we compared with include multi-layer perceptron on unsorted and sorted points as nX3 arrays, RNN model that considers input point as a sequence, and a model based on symmetry functions. The symmetry operation we experimented include max pooling, average pooling and an attention based weighted sum. The attention method is similar to that in [25], where a scalar score is predicted from each point feature, then the score is normalized across points by computing a softmax. The weighted sum is then computed on the normalized scores and the point features. As shown in Fig 5, max- pooling operation achieves the best performance by a large winning margin, which validates our choice.
我们比较的基线方法(如图5所示)包括:将无序和有序点作为
n
×
3
n \times 3
n×3 数组输入到多层感知机模型;将输入点视为序列的RNN模型;以及基于对称函数的模型。我们实验的对称操作包括最大池化、平均池化和基于注意力的加权求和。注意力方法类似于[25]中的方法,其中每个点特征预测出一个标量分数,然后通过softmax在点之间对分数进行归一化,最后使用归一化分数与点特征计算加权和。如图5所示,最大池化操作取得了最佳性能,并且优势显著,这验证了我们的选择。
Figure 5. Three approaches to achieve order invariance. Multi- layer perceptron (MLP) applied on points consists of 5 hidden layers with neuron sizes 64,64,64,128,1024, all points share a single copy of MLP. The MLP close to the output consists of two layers with sizes 512,256.
图5. 实现顺序不变性的三种方法。应用于点的多层感知机(MLP)包含5个隐藏层,神经元数量分别为64、64、64、128、1024,所有点共享同一个MLP副本。接近输出的MLP由两个层组成,大小分别为512和256。
Effectiveness of Input and Feature Transformations 输入和特征变换的有效性
In Table 5 we demonstrate the positive effects of our input and feature transformations (for alignment). It’s interesting to see that the most basic architecture already achieves quite reasonable results. Using input transformation gives a 0.8% performance boost. The regularization loss is necessary for the higher dimension transform to work. By combining both transformations and the regularization term, we achieve the best performance.
在表5中,我们展示了输入和特征变换(用于对齐)对性能的积极影响。值得注意的是,最基本的架构已经能取得相当合理的结果。使用输入变换可提升0.8%的性能。正则化损失对于高维变换的正常工作是必要的。通过结合输入和特征变换以及正则化项,我们获得了最佳性能。
Robustness Test 鲁棒性测试
We show our PointNet, while simple and effective, is robust to various kinds of input corruptions. We use the same architecture as in Fig 5’s max pooling network. Input points are normalized into a unit sphere. Results are in Fig 6.
我们展示了PointNet虽然简单高效,但对各种输入损坏具有很强的鲁棒性。我们使用与图5的最大池化网络相同的架构。输入点被归一化到单位球体中。结果如图6所示。
As to missing points, when there are 50% points missing, the accuracy only drops by 2.4% and 3.8% w.r.t. furthest and random input sampling. Our net is also robust to outlier points, if it has seen those during training. We evaluate two models: one trained on points with (x, y, z) coordinates; the other on (x, y, z) plus point density. The net has more than 80% accuracy even when 20% of the points are outliers. Fig 6 right shows the net is robust to point perturbations.
对于缺失点情况,当缺失50%的点时,相对于最远点采样和随机采样,准确率分别仅下降了2.4%和3.8%。如果训练时见过这些噪声点,我们的网络也对离群点具有鲁棒性。我们评估了两种模型:一种基于
(
x
,
y
,
z
)
(x, y, z)
(x,y,z) 坐标训练,另一种基于
(
x
,
y
,
z
)
(x, y, z)
(x,y,z) 加上点密度训练。即使20%的点是离群点,网络的准确率也超过80%。图6右侧显示了网络对点扰动的鲁棒性。
Table 5. Effects of input feature transforms. Metric is overall classification accuracy on ModelNet40 test set.
表5. 输入特征变换的效果。评价指标是ModelNet40测试集上的整体分类准确率。
Figure 6. PointNet robustness test. The metric is overall
classification accuracy on ModelNet40 test set. Left: Delete points. Furthest means the original 1024 points are sampled with furthest sampling. Middle: Insertion. Outliers uniformly scattered in the unit sphere. Right: Perturbation. Add Gaussian noise to each point independently.
图6. PointNet的鲁棒性测试。评价指标是ModelNet40测试集上的整体分类准确率。左图:删除点。Furthest表示原始的1024个点通过最远点采样获得。中图:插入离群点,离群点均匀散布在单位球体中。右图:扰动,向每个点独立添加高斯噪声。
5.3. Visualizing PointNet 可视化PointNet
In Fig 7, we visualize critical point sets
C
S
C_S
CS and upper-bound shapes
N
S
N_S
NS (as discussed in Thm 2) for some sample shapes
S
S
S. The point sets between the two shapes will give exactly the same global shape feature
f
(
S
)
f(S)
f(S).
在图7中,我们对一些样本形状
S
S
S 可视化了其关键点集
C
S
C_S
CS 和上界形状
N
S
N_S
NS(如定理2中所讨论)。这两种形状之间的点集将生成完全相同的全局形状特征
f
(
S
)
f(S)
f(S)。
We can see clearly from Fig 7 that the critical point sets
C
S
C_S
CS, those contributed to the max pooled feature, summarizes the skeleton of the shape. The upper-bound shapes
N
S
N_S
NS illustrates the largest possible point cloud that give the same global shape feature
f
(
S
)
f(S)
f(S) as the input point cloud
S
S
S.
C
S
C_S
CS and
N
S
N_S
NS reflect the robustness of PointNet, meaning that losing some non-critical points does not change the global shape signature
f
(
S
)
f(S)
f(S) at all.
从图7可以清楚地看到,关键点集
C
S
C_S
CS(对最大池化特征有贡献的点)概括了形状的骨架。上界形状
N
S
N_S
NS 则展示了可能生成与输入点云
S
S
S 相同的全局形状特征
f
(
S
)
f(S)
f(S) 的最大点云。
C
S
C_S
CS 和
N
S
N_S
NS 反映了 PointNet 的鲁棒性,这意味着丢失一些非关键点对全局形状特征
f
(
S
)
f(S)
f(S) 完全没有影响。
The
N
S
N_S
NS is constructed by forwarding all the points in an edge-length-2 cube through the network and select points
p
p
p whose point function values
(
h
1
(
p
)
,
h
2
(
p
)
,
…
,
h
K
(
p
)
)
(h_1(p), h_2(p), \ldots, h_K(p))
(h1(p),h2(p),…,hK(p)) are no larger than the global shape descriptor.
N
S
N_S
NS 是通过将边长为2的立方体中的所有点输入网络构建的,选择点
p
p
p 使其点函数值
(
h
1
(
p
)
,
h
2
(
p
)
,
…
,
h
K
(
p
)
)
(h_1(p), h_2(p), \ldots, h_K(p))
(h1(p),h2(p),…,hK(p)) 不大于全局形状描述符。
Figure 7. Critical points and upper bound shape. While critical points jointly determine the global shape feature for a given shape, any point cloud that falls between the critical points set and the upper bound shape gives exactly the same feature. We color-code all figures to show the depth information.
图7. 关键点和上界形状。关键点共同决定了给定形状的全局形状特征,任何位于关键点集和上界形状之间的点云都会生成完全相同的特征。所有图形使用颜色编码以显示深度信息。
5.4. Time and Space Complexity Analysis 时间和空间复杂度分析
Table 6 summarizes space (number of parameters in the network) and time (floating-point operations/sample) complexity of our classification PointNet. We also compare PointNet to a representative set of volumetric and multi-view based architectures in previous works.
表6总结了我们分类网络PointNet的空间(网络中的参数数量)和时间(每样本的浮点运算数)复杂度。同时,我们还将PointNet与以前工作的代表性体素和多视角架构进行了比较。
While MVCNN [23] and Subvolume (3D CNN) [18] achieve high performance, PointNet is orders more efficient in computational cost (measured in FLOPs/sample:
141
×
141\times
141× and
8
×
8\times
8× more efficient, respectively). Besides, PointNet is much more space efficient than MVCNN in terms of #param in the network (
17
×
17\times
17× less parameters). Moreover, PointNet is much more scalable — its space and time complexity is
O
(
N
)
O(N)
O(N) — linear in the number of input points. However, since convolution dominates computing time, multi-view method’s time complexity grows squarely on image resolution and volumetric convolution based method grows cubically with the volume size.
虽然MVCNN [23]和Subvolume(3D CNN)[18]在性能上表现良好,但PointNet在计算成本上更具效率(以FLOPs/样本衡量,分别提高了
141
×
141\times
141×和
8
×
8\times
8×的效率)。此外,PointNet在参数数量方面也比MVCNN节省了大量空间(参数减少了
17
×
17\times
17×)。更重要的是,PointNet具有更好的可扩展性——其空间和时间复杂度为
O
(
N
)
O(N)
O(N),即与输入点的数量成线性关系。然而,由于卷积占主导地位,多视角方法的时间复杂度随图像分辨率平方增长,而基于体素卷积的方法则随体积大小立方增长。
Empirically, PointNet is able to process more than one million points per second for point cloud classification (around 1K objects/second) or semantic segmentation (around 2 rooms/second) with a 1080X GPU on TensorFlow, showing great potential for real-time applications.
实验结果显示,使用1080X GPU和TensorFlow,PointNet能够以每秒超过百万点的速度处理点云分类(约每秒1,000个对象)或语义分割(约每秒2个房间),这显示了其在实时应用中的巨大潜力。
Table 6. Time and space complexity of deep architectures for 3D data classification. PointNet (vanilla) is the classification PointNet without input and feature transformations. FLOP stands for floating-point operation. The “M” stands for million. Subvolume and MVCNN used pooling on input data from multiple rotations or views, without which they have much inferior performance.
表6. 用于三维数据分类的深度架构的时间和空间复杂度。PointNet(vanilla)是指不带输入和特征变换的分类PointNet。FLOP表示浮点运算,“M”表示百万。Subvolume和MVCNN对来自多个旋转或视角的输入数据使用了池化,否则其性能会显著下降。
6. Conclusion 结论
In this work, we propose a novel deep neural network PointNet that directly consumes point cloud. Our network provides a unified approach to a number of 3D recognition tasks including object classification, part segmentation and semantic segmentation, while obtaining on par or better results than state of the arts on standard benchmarks. We also provide theoretical analysis and visualizations towards understanding of our network.
在本研究中,我们提出了一种新颖的深度神经网络PointNet,可以直接处理点云数据。我们的网络为多种三维识别任务(包括物体分类、部件分割和语义分割)提供了统一的方法,同时在标准基准测试上获得了与当前最先进方法相当或更好的结果。我们还提供了理论分析和可视化,以帮助理解我们的网络。
Acknowledgement. The authors gratefully acknowledge the support of a Samsung GRO grant, ONR MURI N00014- 13-1-0341 grant, NSF grant IIS-1528025, a Google Fo- cused Research Award, a gift from the Adobe corporation and hardware donations by NVIDIA.
致谢 感谢三星GRO资助、ONR MURI N00014-13-1-0341资助、NSF IIS-1528025资助、谷歌重点研究奖、Adobe公司的捐赠以及NVIDIA的硬件捐赠对本研究的支持。