论文翻译 | PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

btee

已于 2022-06-17 09:18:40 修改

阅读量1.3k

点赞数 1

文章标签：深度学习人工智能

于 2022-06-12 19:11:13 首次发布

原文链接：https://arxiv.org/abs/1612.00593

版权

论文翻译 | PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

前言：最近在看点云方面的论文，点云的文章都是看了没多久忘。PointNet是点云领域最经典的论文之一了。目前文章对pointNet进行了详读也比较详细，因此就只写个翻译，再熟悉一下这篇文章。
论文： PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
pytorch版本代码：【here】

摘要

点云是一种重要的几何数据结构。它的形式不规整，因此大多数研究者们都是将它们转换成规则的三维体素网格或图像集的形式处理，但这样做会使数据变得不必要地大，而且还造成了很多问题。这篇文章我们设计了一种新型卷积网络直接处理点云，并且保持了输入点的排列不变性。我们的网络叫PointNet，提供了一个分类，分割，场景语义解析等问题的统一架构。尽管网络很简单，但我们的PointNet非常实用高效。根据实验经验，它有很强大的性能并且甚至比已有的最好方法还要好。理论上，我们提供了对网络学习到了什么的认识，以及网络在扰动下还如此鲁棒的原因的理解。

Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

引言

这篇文章中，我们探索了能推理3D几何数据的学习架构，比如点云或meshes。传统的卷积架构要求高度规整的数据，比如图图像网格或三维体素，以此为了更好地应用权值共享和和其他的核优化策略。因为点云或meshes都不是规整的形式，大多数研究者常规都在输入到深度网络之前，将其转换为规整的图像集或体素网格的形式。这种数据表示形式的转换，使数据变得不必要的大（参数上）。同时也要引入伪影的量化，这会掩盖数据自然的不变性。

In this paper we explore deep learning architectures capable of reasoning about 3D geometric data such as point clouds or meshes. Typical convolutional architectures require highly regular input data formats, like those of image grids or 3D voxels, in order to perform weight sharing and other kernel optimizations. Since point clouds or meshes are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture. This data representation transformation, however, renders the resulting data unnecessarily voluminous — while also introducing quantization artifacts that can obscure natural invariances of the data.

出于这个原因，我们关注了一个用点云构造的不同的三维几何输入表示，并将我们的深度网络命名为PointNets。点云是简单通用的结构，能避免meshes 的组合不规则和复杂性，学习起来也会更容易。然而PointNet遵循了这个规律，即点云是一系列点的集合，因此对其成员的排列扰动具有不变性。于是网络计算时的特定对称性是必要的。其他扰动比如刚性运动也需要考虑进去。

For this reason we focus on a different input representation for 3D geometry using simply point clouds and name our resulting deep nets PointNets. Point clouds are simple and unified structures that avoid the combinatorial irregularities and complexities of meshes,and thus are easier to learn from. The PointNet, however, still has to respect the fact that a point cloud is just a set of points and therefore invariant to permutations of its members, necessitating certain symmetrizations in the net computation. Further invariances to rigid motions also need to be considered.

我们的PointNet是个直接将点云做输入普遍架构，输出分类标签或每个点的分割标签。我们网络基本架构出奇的简单，初始阶段，每个点都是相同和独立处理的。基本设置是每个点被坐标三维表示 (x, y, z)。其他维度的添加则通过归一化计算和全局局部特征提取。

Our PointNet is a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input. The basic architecture of our network is surprisingly simple as in the initial stages each point is processed identically and independently. In the basic setting each point is represented by just its three coordinates (x, y, z). Additional dimensions may be added by computing normals and other local or global features.

我们网络的关键是简单对称函数的使用，max pooling。网络有效率的学到了一系列的最大化函数，即挑选了最关注和最有信息量的点，并编码这个信息（选择的原因）。最后一层的网络全连接层聚合了学习到的最大值成全局的整个形状特征正如上图提到的（分类），或者预测每个点的标签（分割）

Key to our approach is the use of a single symmetricfunction, max pooling. Effectively the network learns a set of optimization functions/criteria that select interesting or informative points of the point cloud and encode the reason for their selection. The final fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape as mentioned above(shape classification) or are used to predict per point labels(shape segmentation).

图片1：PointNet 的应用。我们提出来一个新型的深度网路架构来处理原始点云，不用体素化或rendering的方法。它是个通用架构，学习到了全局和局部特征，提供了简单，实用，高效的方法给了大量3D识别任务（分类，分割，语义分割）

我们的这种数据的输入形式可以很容易的应用到刚性变换或仿射变换上，因为每个点是独立处理的。因此，我们可以加一个数据独立的空域变换网络，它可以在PointNet处理之前正则化数据，因此可以提高性能。

Our input format is easy to apply rigid or affine transformations to, as each point transforms independently. Thus we can add a data-dependent spatial transformer network that attempts to canonicalize the data before the PointNet processes them, so as to further improve the results.

我们提供了方法的理论分析和实验验证。我们展示了我们的网络可以近似到任何连续的集合函数上。更有意思的是，它证明了我们的网络学习是通过系数的点的集合来总结一个输入点的，在可视化的结果可以看出，粗略的对应到物体的骨架。这个理论分析提供了一种对为什么PointNet对小的排列扰动或增加减少部分点是鲁棒的。

We provide both a theoretical analysis and an experimental evaluation of our approach. We show that our network can approximate any set function that is continuous. More interestingly, it turns out that our network learns to summarize an input point cloud by a sparse set of key points, which roughly corresponds to the skeleton of objects according to visualization. The theoretical analysis provides an understanding why our PointNet is highly robust to small perturbation of input points as well as to corruption through point insertion (outliers) or deletion (missing data).

在大量的从形状分类，场景分割，语义分割的基准数据集上，我们实验比较了PointNet和其他基于多视图和体积表示的SOTA网络。在一个通用的的架构下， PointNet不仅速度快，而且展现差不多甚至更强大的性能

On a number of benchmark datasets ranging from shape classification, part segmentation to scene segmentation, we experimentally compare our PointNet with state-of-the-art approaches based upon multi-view and volumetric representations. Under a unified architecture, not only is our PointNet much faster in speed, but it also exhibits strong performance on par or even better than state of the art.

本文的贡献点：

我们设计了一个新型适合无序点的深度网络
我们展示了训练过的网络可以用于分类分割和语义分割的场景
我们提供了该方法的稳定性和高效性的理论个实验分析
我们用图展示了挑选的一些神经元学习到的3维特征，并对其提出了直观解释

通过神经网络处理无序点集是个很通用且基础的问题，我们期待我们的想法也能对用到其他领域。

• We design a novel deep net architecture suitable for consuming unordered point sets in 3D;
• We show how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;
• We provide thorough empirical and theoretical analysis on the stability and efficiency of our method;
• We illustrate the 3D features computed by the selected neurons in the net and develop intuitive explanations for its performance.
The problem of processing unordered sets by neural nets is a very general and fundamental problem – we expect that our ideas can be transferred to other domains as well.

问题陈述

我们设计了一个深度学习网络，可以直接处理输入的无序点。一个点云被表示成3D点的集合，{Pi| i = 1, …, n}，每个点Pi是个坐标的三维向量加上其他特征比如颜色，归一化。简单起见，我们只用坐标做点的通道

We design a deep learning framework that directly consumes unordered point sets as inputs. A point cloud is represented as a set of 3D points {Pi| i = 1, …, n}, where each point Pi is a vector of its (x, y, z) coordinate plus extra feature channels such as color, normal etc. For simplicity and clarity, unless otherwise noted, we only use the (x, y, z) coordinate as our point’s channels.

对于物体分类的任务，输入点云是直接从形状或场景预分割后的点采样的。我们的深网络的输出K个分数作为K个待定类别。对于场景分割，输入可以是场景部分领域的单个物体，或者是3D场景的子体素来做物体领域的分割。我们的模型可以输出n × m的分数，N个点和M个场景子类别。

For the object classification task, the input point cloud is either directly sampled from a shape or pre-segmented from a scene point cloud. Our proposed deep network outputs k scores for all the k candidate classes. For semantic segmentation, the input can be a single object for part region
segmentation, or a sub-volume from a 3D scene for object region segmentation. Our model will output n × m scores for each of the n points and each of the m semantic subcategories.

点集的深度学习

n维实数域的点集性质：
我们的输入是欧几里得空间的点子集。它们有三个主要性质。

无序性。不像图像中或体素中排列的像素点，点云是没有特定顺序的点的集合。换句话说网络要对N！中不同排列组合的都要是不变的。
相关性。空间中的点有着距离维度，它意味着点不是孤立存在的，在有特定意义的子集中和其他点组成邻域。因此，模型需要能处理附近点的局部结构，并且结合这些邻域点的相互关系。
变换不变性。对于几何物体，点集中学到的表征应该对于特定变换是不变的。比如，旋转和平移都不应该改变全局点的类别，也不应该改变点的分割结果。

Properties of Point Sets in Rn
Our input is a subset of points from an Euclidean space.It has three main properties:
• Unordered. Unlike pixel arrays in images or voxel arrays in volumetric grids, point cloud is a set of points without specific order in the points. In other words, a network that consumes N 3D point sets needs to be invariant to N! permutations of the input set in data feeding order.
• Interaction among points. The points are from a space with a distance metric. It means that points are not isolated, and neighboring points form a meaningful subset. Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures.
• Invariance under transformations. As a geometric object, the learned representation of the point set
should be invariant to certain transformations. For example, rotating and translating points all together should not modify the global point cloud category nor the segmentation of the points.

我们网络有三个关键模块：最大池化层作为对称层来聚合所有点的信息，一个局部和全局的信息组合架构，还有两个联合的对齐网络来对齐输入点和输入特征。我们将讨论我们设计选择的背后的原因在接下的段落中。堆成

Our network has three key modules: the max pooling layer as a symmetric function to aggregate information from all the points, a local and global information combination structure, and two joint alignment networks that align both input points and point features. We will discuss our reason behind these design choices in separate paragraphs below.

无序输入的对称函数：为了使模型对输入的排列扰动是不变的，存在三个策略。一是将输入排序作为规范的顺序输入（2）将输入视作序列来训练RNN，但是通过各种排列排序的增强来训练数据。三是2使用一个简单的对称函数来聚合每个点的信息。这里，一个对称函数将N个向量作为输入，输出一个新向量，这个新向量是对这些输入的排序是不变的。比如，+和*的操作就是对称的二值操作。

Symmetry Function for Unordered Input
In order to make a model invariant to input permutation, three strategies exist: 1) sorting input into a canonical order; 2) treat the input as a sequence to train an RNN, but augment the training data by all kinds of permutations; 3) use a simple symmetric function to aggregate the information from each point. Here, a symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order. For example, + and ∗ operators are symmetric binary functions.

排序听起来是个简单的办法，但在高维空间中，其实并不存在一个适用任何场景的稳定的顺序。这可以被反证法证明。如果，这样的排序是存在的，它定义了高维空间和一维空间的双向映射。不难看出，要求排序稳定就等价于要求该映射在维度下降时保持空间的接近性。这是一个在通用场景下都不能实现的任务。因此排序并不能完全解决这个问题。在实验中展示了，我们发现直接应用MLP在顺序的点集上表现较差，但是在无序的输入上表现较好。

While sorting sounds like a simple solution, in high dimensional space there in fact does not exist an ordering that is stable w.r.t. point perturbations in the general sense. This can be easily shown by contradiction. If such an ordering strategy exists, it defines a bijection map between a high-dimensional space and a 1d real line. It is not hard to see, to require an ordering to be stable w.r.t point perturbations is equivalent to requiring that this map preserves spatial proximity as the dimension reduces, a task that cannot be achieved in the general case. Therefore, sorting does not fully resolve the ordering issue, and it’s hard for a network to learn a consistent mapping from input to output as the ordering issue persists. As shown in experiments (Fig 5), we find that applying a MLP directly on the sorted point set performs poorly, though slightly better than directly processing an unsorted input.

使用RNN的思想考虑了点集为序列信号，并且期待训练随机排列的RNN，RNN就可以对于顺序是不变的了。然而，在“OrderMatters”，作者就表明，顺序很重要并不能被忽视。RNN对较短序列长度的输入有相对较好的鲁棒性，它很难扩大大上千和输入单元，但这在点云中很常见。根据实验，我们还展示了我们的基于RNN的模型，并不如我们所提出来的模型好。

The idea to use RNN considers the point set as a sequential signal and hopes that by training the RNN with randomly permuted sequences, the RNN will become invariant to input order. However in “OrderMatters” the authors have shown that order does matter and cannot be totally omitted. While RNN has relatively good robustness to input ordering for sequences with small length (dozens), it’s hard to scale to thousands of input elements, which is the common size for point sets. Empirically, we have also shown that model based on RNN does not perform as well as our proposed method (Fig 5).

我们的想法可以近似于一个定义在点集上的通用函数，通过应用对称函数在正则转换后的点上。

Our idea is to approximate a general function defined on a point set by applying a symmetric function on transformed elements in the set：

在这里插入图片描述
经验上，我们的基础模块很简单：我们通过多层感知网络近似h和通过简单的变换组合函数和最大池化函数近似g。在实验中这个很有效，尽管h的集合，我们可以捕捉到大量不同的性质
虽然我们的模块看起来很简单，它有着有意思的性质并且可以在一些不同的应用实现强大的性能。由于模块简单，我们也提供了理论分析。

Empirically, our basic module is very simple: we approximate h by a multi-layer perceptron network and g by a composition of a single variable function and a max pooling function. This is found to work well by experiments. Through a collection of h, we can learn a number of f’s to capture different properties of the set.
While our key module seems simple, it has interesting properties (see Sec 5.3) and can achieve strong performace (see Sec 5.1) in a few different applications. Due to the simplicity of our module, we are also able to provide theoretical analysis as in Sec 4.3.

局部和全局的信息聚合：挑向量形式的输出，是输入的全局记号。我们可以简单的送入SVM或多层感知机的分类器对对全局和形状特征进行分类，然而，点的分割要求这局部和全局知识的组合。我们可以通过简答有效的操作实现它。
我们的解决方案在图2中展示（分割网络）。在计算全局点云特征向量之后，我们将新的特征连接到每个点的特征之后并送到网络中。然后我们提取基于组合的特征新的点的特征。这一次每个点的特征就可以看做局部和全局信息的融合了。
将网络修改到可以预测每个点的局部几何信息和全局语义信息。比如，我们可以准确的预测每个点的正则，验证网络可以总结点的局部领域点的信息。在实验场景下，我们也展示了我们的模型可以实现S分类和分割应用下的OTA的性能

Local and Global Information Aggregation
The output from the above section forms a vector [f1, . . . , fK], which is a global signature of the input set. We can easily train a SVM or multi-layer perceptron classifier on the shape global features for classification. However, point segmentation requires a combination of local and global knowledge. We can achieve this by a simple yet highly effectively manner.
Our solution can be seen in Fig 2 (Segmentation Network). After computing the global point cloud feature vector, we feed it back to per point features by concatenating the global feature with each of the point features. Then we extract new per point features based on the combined point features - this time the per point feature is aware of both the local and global information.
With this modification our network is able to predict per point quantities that rely on both local geometry and global semantics. For example we can accurately predict per-point normals (fig in supplementary), validating that the network is able to summarize information from the point’s local neighborhood. In experiment session, we also show that our model can achieve state-of-the-art performance on shape part segmentation and scene segmentation.

联合的对齐网络：如果点云经历了特定几何变换，比如刚性变换，点云的语义标签具有不变性。因此我们期望学习到的点集的表示也是对这些变换具有不变性的。
一个自然的解决办法就是，在特征提取之前，通过对齐所有的输入点集到一个规范的空间。Jaderberg et al.介绍了一种空域变换的思想通过采样和插值的方法来对齐2D图片，在GPU上设一个特定的层来实现。
我们的点云的输入形式允许我们以更简单的方式实现这个目标。我们不需要插入任何新层，也不需要引入一些alias。我们预测一个通过mini网络的仿射变换矩阵和直接应用这个变换使输入规范。这个mini网络本身像一个大网络，是由几个剧本的独立点特征提取模块，最大池化和全连接层组成的

Joint Alignment Network
The semantic labeling of apoint cloud has to be invariant if the point cloud undergoes certain geometric transformations, such as rigid transformation. We therefore expect that the learnt representation by our point set is invariant to these transformations.
A natural solution is to align all input set to a canonical space before feature extraction. Jaderberg et al. introduces the idea of spatial transformer to align 2D images through sampling and interpolation, achieved by a specifically tailored layer implemented on GPU.
Our input form of point clouds allows us to achieve this goal in a much simpler way compared with. We do not need to invent any new layers and no alias is introduced as in the image case. We predict an affine transformation matrix by a mini-network and directly apply this transformation to the coordinates of input points. The mini-network itself resembles the big network and is composed by basic modules of point independent feature extraction, max pooling and fully connected layers.

这个思想可以被扩展到特征形状的对齐。我们可以插入另一个点特征对齐网络并通过不同输入点云对齐特征预测特征变换矩阵。然而，特征域的变换矩阵会比空域变换矩阵有更多维度，因此也会更大程度的增加优化的难度。我们因此添加了一个 our softmax training loss.的正则化式子。我们约束特征变换矩阵来接近以下的矩阵。
A是被mini网络预测出的特征变换矩阵。一个正交变换不会损失输入的信息，因此它是理想的。我们发现通过添加正则项，优化能变得更稳定，网络也能取得更好的性能。

This idea can be further extended to the alignment of feature space, as well. We can insert another alignment network on point features and predict a feature transformation matrix to align features from different input point clouds. However, transformation matrix in the feature space has much higher dimension than the spatial transform matrix, which greatly increase the difficulty of optimization. We therefore add a regularization term to our softmax training loss. We constraint the feature transformation matrix to be close to orthogonal matrix:

where A is the feature alignment matrix predicted by a mini-network. An orthogonal transformation will not lose information in the input, thus is desired. We find that by adding the regularization term, the optimization become more stable and our model achieves better performance.

实验

应用
在这节我们展示了我们的网络是如何被训练来进行3D物体分类分割和场景语义分割的。尽管我们在全新的数据表现形式上，我们也能在基准测试集上实现差不多甚至更好地性能。

In this section we show how our network can be trained to perform 3D object classification, object part segmentation and semantic scene segmentation 1. Even though we are working on a brand new data representation(point sets), we are able to achieve comparable or even better performance on benchmarks for several tasks.

3D物体分类：我们的网路学习到可以用到分类上的点云特征。我们在 ModelNet40 [分类集上评估了我们的模型。虽然以前的方法关注于基于体素和多视图的图像表示形式，我们是第一个直接处理稀疏点云的工作。

3D Object Classification
Our network learns global point cloud feature that can be used for object classification. We evaluate our model on the ModelNet40 [25] shape classification benchmark. There are 12,311 CAD models from 40 man-made object categories, split into 9,843 for training and 2,468 for testing. While previous methods focus on volumetric and mult-view image representations,we are the first to directly work on raw point cloud.

我们统一根据面积在网格面上采样看1024个点。点被归一化到了一个标准的单位球里。在训练过程中我们通过沿着向上的轴随机旋转点云，和通过增加均值为0高斯标准差为0.02的扰动来做训练的数据增强。
在表一中，我们比较了我们的模型和之前的工作，也比较了我们用MLP处理的从点云提取的传统特征。我们的模型在3D输入的方法中取得了SOTA的性能表现。只用到了全连接和最大池化，我们的网络在推理速度上领先许多，可以被GPU并行实现。但和多视图的方法相比，我们的方法还有差距。我们认为这是由于我们的模型失去了精细的几何细节，这些细节可以被渲染的图像捕获到。

We uniformly sample 1024 points on mesh faces according to face area. Points are normalized into a unit sphere with. During training we augment the point cloud on-the-fly by randomly rotating the object along the up-axis and jitter the position of each points by a Gaussian noise with zero mean and 0.02 standard deviation.
In Table 1, we compare our model with previous works as well as our baseline using MLP on traditional features extracted from point cloud (point density, D2, shape contour etc.). Our model achieved state-of-the-art performance among methods based on 3D input (volumetric and point cloud). With only fully connected layers and max pooling,our net gains a strong lead in inference speed and can be easily parallelized in CPU as well. There is still a small gap between our method and multi-view based method (MVCNN [20]), which we think is due to the loss of fine geometry details that can be captured by rendered images.

场景分割是一个大挑战具有细粒度的三维识别任务。给一个三维的扫描图或mesh模型，任务是将每个点或面分到对应的类别上。我们评估了ShapeNet 的部分数据集，包含了16类别的 16,881 个形状，总共是50个部分。大多数物体类别都是被分成2-3个类别。地面真值标记在这个形状中的采样后的点上。
我们将分割任务作为一个每个点的分类问题。评价准则是每个部分点集的IoU，和一个形状的全部IoU。在这个部分，我们比较了我们的分割版本于传统的方法，它们都利用了点的几何特征和形状之间的对应，基于体素的深度学习的3D CNN也是。我们设计了一个3DCNN的分割架作为完全的卷积，保持每一层都是相同大小的体积。最后每个体素都有一个19体素，32空域分辨率的感受野。
在表2中，我们报告了每个类别和平均IoU(%)分数。我们观察到2.3% 的平均 IoU 的提升，我们的网络打败了基线网络。

3D Object Part Segmentation
Part segmentation is a challenging fine-grained 3D recognition task. Given a 3D scan or a mesh model, the task is to assign part category label (e.g. chair leg, cup handle) to each point or face.
We evaluate on ShapeNet part data set from, which contains 16,881 shapes from 16 categories, annotated with 50 parts in total. Most object categories are labeled with two to five parts. Ground truth annotations are labeled on sampled points on the shapes.
We formulate part segmentation task as a per-point classification problem. Evaluation metric is IoU on points for every part on a shape and mean IoU across shapes. In this section, we compare our segmentation version PointNet (Fig 2, Segmentation Network) with two traditional methods and that both take advantage of point-wise geometry features and correspondences between shapes, as well as 3D CNN, a volumetric deep learning method. We design the segmentation 3D CNN architecture as a fully convolutional one that keeps volume size through all layers. In the end each voxel has a receptive field of 19 voxels with spatial resolution of 32.
In Table 2, we report per-category and mean IoU(%) scores. We observe a 2.3% mean IoU improvement and our net beats the baseline methods in most categories.

我们也做了simulated Kinect scans实验来检验这些方法的鲁棒性。ShapeNet 数据集中每一个CAD模型，我们都用了 Blensor Kinect Simulator来从6个视角产生不完全的点云。我们在完全的形状上训练我们的PointNet，并部分扫描相同的网络架构和训练设置。结果显示，我们仅下降了 5.3% 平均IoU。我们在完整数据和部分数据上展现了我们的定性结果。可以看出的是，尽管部分数据很有挑战，但我们的预测却很合理。

We also perform experiments on simulated Kinect scans to test the robustness of these methods. For every CAD model in the ShapeNet part data set, we use Blensor Kinect Simulator to generate incomplete point clouds from six random viewpoints. We train our PointNet on the complete shapes and partial scans with the same network architecture and training setting. Results show that we lose only 5.3% mean IoU. In Fig 3, we present qualitative results on both complete and partial data. One can see that though partial data is fairly challenging, our predictions are reasonable.

我们的网格在场景分割上的应用很容易扩展到场景语义分割，其中每个点变成了语义的物体类别而不是物体的部分标签
我们在 Stanford 3D semantic parsing data set做了实验。数据集包括了6个场景下271个房间的Matterport scanners的3D扫描，扫描中的每个点被标记成语义的13个类别。为了准备训练数据，我们首先通过房间分割点，然后采样房间变成一个个的1米*1米的block。我们训练我们PointNet的分割版本来预测每个block的每个点。在训练时，我们随机的在一个block中采样4096个点。我们测试了所有点，我们跟着以往的协议的K-foldz策略分训练集和测试集。

Semantic Segmentation in Scenes
Our network on part segmentation can be easily extended to semantic scene segmentation, where point labels become semantic object classes instead of object part labels.
We experiment on the Stanford 3D semantic parsing data set [1]. The dataset contains 3D scans from Matterport scanners in 6 areas including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall etc. plus clutter). To prepare training data, we firstly split points by room, and then sample rooms into blocks with area 1m by 1m. We train our segmentation version of PointNet to predict per point class in each block. Each point is represented by a 9-dim vector of XYZ, RGB and normalized location as to the room (from 0 to 1). At training time, we randomly sample 4096 points in each block on-the-fly. At test time,
we test on all the points. We follow the same protocol as [1] to use k-fold strategy for train and test.

我们比较了我们的方法和用手工方法提取特征的基线。这个基线提取了同样9维的特征和三个增加的，局部点密度，局部曲率和法线，我们用标准的MLP做分类器。结果展示在了表3中，我们PointNet 的方法与之相比性能大大提高了。图四中我们展示了定性的分割结果。我们的网络能够输出平滑的预测，并对缺失点和遮挡具有鲁棒性。

We compare our method with a baseline using handcrafted point features. The baseline extracts the same 9-dim local features and three additional ones: local point density, local curvature and normal. We use standard MLP as the classifier. Results are shown in Table 3, where our PointNet method significantly outperforms the baseline method.
In Fig 4, we show qualitative segmentation results.Our network is able to output smooth predictions and is robust to missing points and occlusions.

基于我们网络的语义分割输出，我们进一步构建了一个使用连接组件进行对象建议的三维对象检测系统（详见补充）。我们与表4中以前的最先进的方法进行了比较。前一种方法基于滑动形状法(采用CRF后处理)，在体素网格中使用局部几何特征和全局房间上下文特征训练支持向量机。我们的方法在报告的家具类别上大大优于它。

Based on the semantic segmentation output from our network, we further build a 3D object detection system using connected component for object proposal (see supplementary for details). We compare with previous stateof-the-art method in Table 4. The previous method is based on a sliding shape method (with CRF post processing) with SVMs trained on local geometric features and global room context feature in voxel grids. Our method outperforms it by a large margin on the furniture categories reported.

这一节我们通过实验验证了我们的实验选择。我们也展示了我们网络参数的作用。
如4.2节提到的，至少有三个选项用于使用无序集输入。我们用ModelNet40 的形状分类问题作为选项的测试平台，下面的两个控制实验也将使用这个任务。

5.2. Architecture Design Analysis
In this section we validate our design choices by control experiments. We also show the effects of our network’s hyperparameters.
Comparison with Alternative Order-invariant Methods
As mentioned in Sec 4.2, there are at least three options for consuming unordered set inputs. We use the ModelNet40 shape classification problem as a test bed for comparisons of those options, the following two control experiment will also use this task.

我们与之比较的基线包括未排序和排序点上的n×3阵列多层感知机，将输入点视为序列的RNN模型，以及一个基于对称函数的模型。我们实验的对称操作包括最大池化、平均池化和基于注意力的加权。注意力方法和某篇文章中的方法很相似，从每个点特征预测一个评价得分，然后通过计算一个softmax对各点的得分进行归一化。然后根据标准化分数和点特征计算加权和。如图5所示，最大池操作以较大的优势获得最佳性能，验证了我们的选择。

The baselines (illustrated in Fig 5) we compared with include multi-layer perceptron on unsorted and sorted points as n×3 arrays, RNN model that considers input point as a sequence, and a model based on symmetry functions. The symmetry operation we experimented include max pooling, average pooling and an attention based weighted sum. The attention method is similar to that in [22], wherea scalar score is predicted from each point feature, then the score is normalized across points by computing a softmax. The weighted sum is then computed on the normalized scores and the point features. As shown in Fig 5, maxpooling operation achieves the best performance by a large winning margin, which validates our choice.

在表5中，我们演示了我们的输入和特征转换（用于对齐）的积极效果。有趣的是，最基本的架构就已经取得了相当好的结果。使用输入转换可以提高0.8%的性能。正则化损失是进行高维变换工作的必要条件。通过结合转换和正则化项，我们获得了最好的性能

Effectiveness of Input and Feature Transformations
In Table 5 we demonstrate the positive effects of our input and feature transformations (for alignment). It’s interesting to see that the most basic architecture already achieves quite reasonable results. Using input transformation gives a 0.8% performance boost. The regularization loss is necessary for the higher dimension transform to work. By combining both transformations and the regularization term, we achieve the best performance.

鲁棒性测试，我们证明了我们的点网络虽然简单有效，但对各种输入扰动具有鲁棒性。我们使用与图5中的最大池化网络相同的架构。输入点被为一个归一化单位球体。结果见图6。对于缺失点，当缺失50%时，准确率仅下降2.4%和3.8%，分别对应最远的和随机的输入采样。我们的网络对离群值也很稳健，如果它在训练中看到过的话。我们评估两个模型：一个在具有(xyz)坐标的点上训练；另一个在(xyz）上加上点密度。即使有20%的点是异常值，准确率也能达到80%。图6右显示了网络对点扰动具有鲁棒性。

Robustness Test
We show our PointNet, while simpleand effective, is robust to various kinds of input corruptions.
We use the same architecture as in Fig 5’s max pooling network. Input points are normalized into a unit sphere.Results are in Fig 6. As to missing points, when there are 50% points missing, the accuracy only drops by 2.4% and 3.8% w.r.t. furthest and random input sampling. Our net is also robust to outlier points, if it has seen those during training. We evaluate two models: one trained on points with (x, y, z) coordinates; the other on (x, y, z) plus point density. The net has more than 80% accuracy even when 20% of the points are outliers. Fig 6 right shows the net is robust to point perturbations.

我们设计了两个实验，来可视化点网所学到的。研究结果与第4.3节中的理论分析结果一致。在第一个实验中，我们在Eq1中可视化了学习到的点函数h(x)，这表明我们的网络从点云中学习到了一系列选择信息点的优化标准。我们的第二个实验说明了我们的网络的鲁棒性，如Thm2所示。

5.3. Visualizing PointNet
We design two experiments to visualize what has been learnt by the PointNet. The results are consistent with the theoretical analysis in Sec 4.3. In the first experiment, we visualize the learnt point function h(x) in Eq 1, which demonstrates that our network learns a family of optimization criteria that selects informative points from the cloud. Our second experiment illustrates the robustness of our network, as explained in Thm 2.

如第4.2节所讨论的，我们的网络计算每个点的K(我们取K=1024)维点特征，并通过最大池化层将所有每个点的局部特征聚合为一个K-dim向量，从而形成全局形状描述符。为了更深入地了解学习到的每点函数h检测到什么，我们将图7中给出较高的每点函数值f(pi)的点pi可视化。这个可视化清楚地显示了不同的点函数学习到了分布在整个空间的不同领域的多种形状。

Point Function Visualization
As discussed in Sec 4.2,our network computes K (we take K = 1024 in this experiment) dimension point features for each point and aggregates all the per-point local features via a max pooling layer into a single K-dim vector, which forms the global shape descriptor
To gain more insights on what the learnt per-point functions h’s detect, we visualize the points pi’s that give high per-point function value f(pi) in Fig 7. This visualization clearly shows that different point functions learn to detect for points in different regions with various shapes scattered in the whole space.

我们可视化形状族，如Thm2中说的的，包括临界点集CS和上界形状NS之间的所有形状，这些形状对于给定的形状S给出相同的全局特征f(S)。
图8显示了四种选定形状的临界点集CS和上界形状NS。原始输入点云呈现在第一行，而它们的临界点集和上界形状分别显示在第二行和第二行。给定形状S的CS包括来自原始点云的点，激活一些每点函数hi的最多。NS是通过将一个直径-2的立方体中的所有点通过网络，并选择每个点函数值p§、h2（p)、hK§）不大于全局形状描述符来构造的。不难看出，覆盖CS和包含于NS的所有形状S0都给出了完全相同的全局特征f(S0)=f(S)。过渡形状族需要我们的点网的鲁棒性，这意味着添加噪声抖动或丢失一些非临界点不会改变学习到的形状特征，因此不会影响我们的分类或分割结果。

Global Feature Visualization
We visualize the shape family, as discussed in Thm 2, including all the shapes between the critical point set CS and the upper-bound shape NS that gives the same global feature f(S) with respect to a given shape S.
Fig 8 shows the critical point sets CS and upper-bound shapes NS for four selected shapes. The original input point clouds are rendered in the first row while the CS and NS for them are shown in the second and their rows respectively. The CS for a given shape S includes the points from the original point cloud that activates some per-point function hi’s the most. The NS is constructed by pushing all the points in a diameter-2 cube through the network and select the points p whose per-point function values (h1§, h2§, · · · , hK§) are no larger than the global shape descriptor. It is not hard to see that all the shapes S0 that cover CS and are contained by NS give the exactly same global feature f(S0) = f(S). The transition shape family entails the robustness of the our PointNet, meaning that adding noisy jitterings or losing some non-critical points do not change the learnt shape signature and thus do not affect our classification or segmentation results.

总结

In this work, we propose a novel deep neural network PointNet that directly consumes point cloud. Our network provides a unified approach to a number of 3D recognition tasks including object classification, part segmentation and semantic segmentation, while obtaining on par or better results than state of the arts on standard benchmarks. We also provide theoretical analysis and visualizations towards understanding of our network.