点云网络的论文理解（一）-点云网络的提出 PointNet : Deep Learning on Point Sets for 3D Classification and Segmentation

最新推荐文章于 2024-05-15 11:35:33 发布

CUHK-SZ-relu

最新推荐文章于 2024-05-15 11:35:33 发布

阅读量2.9k

点赞数 10

分类专栏： PointNet 文章标签：深度学习

本文链接：https://blog.csdn.net/qq_43210957/article/details/118336042

版权

PointNet 专栏收录该内容

4 篇文章 7 订阅

订阅专栏

1.摘要

1.1逐句翻译

Point cloud is an important type of geometric data structure.
点云是一种重要的数据结构。
Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images.
由于其格式不规则，大多数研究人员将此类数据转换为规则的3D体素网格或图像集合。
This, however, renders data unnecessarily voluminous and causes issues.
然而，这会导致不必要的大量数据渲染并引起问题。
In this paper, we design a novel type of neural network that directly consumes point clouds,
本文设计了一种直接消耗点云的新型神经网络，
which well respects the permutation invariance of points in the input.
这很好地考虑了输入中点的排列不变性。
Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.
我们的网络名为PointNet，为从对象分类、内容部分分割到场景语义解析等应用程序提供了统一的体系结构。
Though simple, PointNet is highly efficient and effective.
虽然简单，PointNet却非常高效。
Empirically, it shows strong performance on par or even better than state of the art.
从经验上看，它显示了与现有技术水平相当甚至更好的强大性能。
Theoretically,we provide analysis towards understanding of what the network has learnt and why the network is robust with
respect to input perturbation and corruption.
理论上，我们提供分析来理解网络学习了什么，以及为什么网络关于输入干扰和腐败具有鲁棒性。

1.2总结

我们都知道，摘要一篇文章的最重要部分，所以我们应当先阅读摘要部分。这里大约说了几个事情：

1.点云是一个重要的数据结构，所以有研究的必要。
2.点云有自己本身你的特性，也就是 irregular format，之前的研究人员都想将其转化为一种立体模型，但是这恰恰是丢失了其本身的特性。（natural invariances）
我理解这里说的大约意思就是，就像是一开始的机器学习，先要提取信息，之后再将提取到的信息进行分析得出结果。我个人理解这里有一个问题就是中间提取的信息的好坏很难评价。不如直接将提取的信息和分析信息交给一个深度学习模型。
类比起来，这里虽然我们中间得的的结果很好评价，直接看像不像一个立体形状就行了。但是，在这个转化为立体图形的过程中，可能会丢失一些我没注意到的信息。所以这个文章直接从点云出发，到各种具体的应用。
(上面内容都是我一开始阅读的感受，读完之后我又有了下面的感受：)
这里提出的natural invariances，主要是指的旋转的时候其本身不会发射变化，文章正是利用这个性质想到了对称函数使用的必要性，并在实验中产生了很好的效果。
3.另外会引入很多不必要的复杂（voluminous）
我理解这里首先第一个问题，就是维度陷阱的问题，这个转化为三维的过程中结构变得更加复杂，所以我们的模型在处理这些输入的过程中，就需要更多地参数，也就引发更多地参数调整，也就需要更多的数据集，这是我们不想看到的
4.所以文章提出了一种从点云开始一步到位的模型
刚开始阅读的时候这里我是没有读懂的，（阅读全文之后明白）这个特点应该指的是，相比于之前的模型还需要额外进行结构的提取，这里只需要从点云开始就能获得最终的效果。
5.另外，文章也对模型学习的内容作了一定的解释。
这个点应该就是体现了最近的深度学习的一个潮流，也就是越来越向可解释的方向发展。大家都想对自己设计的模型做出一个解释。
6.另外文章也处理了鲁棒性的问题
鲁棒性这个东西是大家都要考虑的问题，我们要做的就是吸收这里他处理这些通用的鲁棒性问题的方法。我认为任何实际应用的内容都应经过鲁棒性测试，所以这里的测试虽然不是创新点，但是十分必要。

2. Introduction

2.1逐句翻译

原文部分
第一段
In this paper we explore deep learning architectures capable of reasoning about 3D geometric data such as point clouds or meshes.
在本文中，我们探索了能够推理三维几何数据(如点云或网格图)的深度学习体系结构。
Typical convolutional architectures require highly regular input data formats, like those of image grids or 3D voxels, in order to perform weight sharing and other kernel optimizations.
典型的卷积架构要求高度规则的输入数据格式，如图像网格或3D体素，以执行权重共享和其他内核优化。
Since point clouds or meshes are not in a regular format, most researchers typically transform such data to regular 3D voxel grids or collections of images (e.g, views) before feeding them to a deep net architecture.
由于点云或网格不是常规格式，大多数研究人员通常会将这些数据转换为常规的3D体素网格或图像集合(例如视图)，然后再将它们输入到深层网络架构中。
This data representation transformation, however, renders the resulting data unnecessarily voluminous — while also introducing quantization artifacts that can obscure natural invariances of the data
然而，这种数据表示转换会使结果数据变得不必要的庞大——同时还会引入量化工件，从而掩盖数据的自然不变性

第二段
For this reason we focus on a different input representation for 3D geometry using simply point clouds– and name our resulting deep nets PointNets.
出于这个原因，我们专注于开创一种不同的网络模型，直接处理点云原有的数据-并且将我们得到的结果网络命名为PointNets.
Point clouds are simple and unified structures that avoid the combinatorial irregularities and complexities of meshes,and thus are easier to learn from.
点云是简单统一的结构，避免了不规则的组合和复杂的网格，因此更容易学习。
The PointNet, however,still has to respect the fact that a point cloud is just a set of points and therefore invariant to permutations of its members, necessitating certain symmetrizations in the net computation.
然而，PointNet仍然必须尊重这样一个事实:点云只是点的集合，因此对其成员的排列是不变的，因此需要在网络计算中进行一定的均衡。
Further invariances to rigid motions also need to be considered.
还需要考虑刚性运动的进一步不变性。

第三段
Our PointNet is a unified architecture that directly takes point clouds as input and outputs either class labels for the entire input or per point segment/part labels for each point of the input.
我们的PointNet是一个统一的体系结构，它直接将点云作为输入和输出，要么是整个输入的类标签，要么是输入的每个点的分段/部分标签。
The basic architecture of our network is surprisingly simple as in the initial stages each point is processed identically and independently.
我们的网络的基本架构出奇地简单，因为在初始阶段，每个点都以相同和独立的方式处理。
In the basic setting each point is represented by just its three coordinates (x, y, z). Additional dimensions may be added by computing normals and other local or global features.
在基本设置中，每个点仅用它的三个坐标(x, y, z)表示。额外的维度可以通过计算法线和其他局部或全局特征来添加。

第四段
Key to our approach is the use of a single symmetric function, max pooling.
我们的方法的关键是使用单一的对称函数，最大池化。
Effectively the network learns a set of optimization functions/criteria that select interesting or informative points of the point cloud and encode the reason for their selection.
该网络有效地学习了一组优化函数/准则，这些函数/准则选择点云中有趣的或有信息的点，并对其选择的原因进行编码。
（我理解这里说的就是可以训练一个函数，这个函数可以从中选出一些包含很多信息的点，并对其进行编码，也就是转换或是加入信息）
The final fully connected layers of the network aggregate these learnt optimal values into the global descriptor for the entire shape as mentioned above (shape classification) or are used to predict per point labels (shape segmentation).
网络的最后一层全连接层：将这些学习到的最优值聚集到整个形状的全局描述符中(形状分类)或用于预测每个点标签(形状分割)。

第五段
Our input format is easy to apply rigid or affine transfor-mations to, as each point transforms independently.
我们的输入格式很容易应用刚性或仿射变换，因为每个点都是独立变换的。大约就是下图：

Thus we can add a data-dependent spatial transformer network that attempts to canonicalize the data before the PointNet processes them, so as to further improve the results.
因此，我们可以添加一个数据相关的空间变压器网络，在PointNet处理数据之前尝试对数据进行规范化，从而进一步改善结果。

第六段
We provide both a theoretical analysis and an experimental evaluation of our approach.
我们提供了我们的方法的理论分析和实验评估。
We show that our network can approximate any set function that is continuous.
我们证明了我们的网络可以近似任何连续的集合函数。（因为这个东西是个点集）
More interestingly, it turns out that our network learns to summarize an input point cloud by a sparse set of key points, which roughly corresponds to the skeleton of objects according to visualization.
更有趣的是，我们的网络将输入的点云总结成了一个稀疏的点的集合，根据可视化，关键点恰好大致对应于对象的骨架。
The theoretical analysis provides an understanding why our PointNet is highly robust to small perturbation of input points as well as to corruption through point insertion (outliers) or deletion(missing data).
通过这个，理论分析提供了一个理解，为什么我们的PointNet对于输入点的小扰动以及通过点插入(异常值)或删除(缺失数据)造成的损坏是高度鲁棒的。

第七段
On a number of benchmark datasets ranging from shape classification, part segmentation to scene segmentation,we experimentally compare our PointNet with state-of-the-art approaches based upon multi-view and volumetric representations.
在一些基准数据集上，从形状分类，部分分割到场景分割，我们实验性地比较了PointNet与基于多视图和体积表示的最先进的方法。
Under a unified architecture, not only is our PointNet much faster in speed, but it also exhibits strong performance on par or even better than state of the art.
在统一的体系结构下，我们的PointNet不仅速度快得多，而且表现出与现有技术水平相当甚至更好的性能。

第八段：文章自己提出的自己做的贡献
The key contributions of our work are as follows:
我们工作的主要贡献如下:
• We design a novel deep net architecture suitable for consuming unordered point sets in 3D;
我们设计了一种新的适用于消费无序点集的三维深网体系结构;
• We show how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;
我们展示了如何训练这样一个网络来执行三维形状分类，形状部分分割和场景语义分析任务;
• We provide thorough empirical and theoretical analysis on the stability and efficiency of our method;
我们对我们的方法的稳定性和效率提供了深入的经验和理论分析;
• We illustrate the 3D features computed by the selected neurons in the net and develop intuitive explanations for its performance.
我们举例说明网络中选定的神经元计算出的三维特征，并对其性能发展出直观的解释。

第九段
The problem of processing unordered sets by neural nets is a very general and fundamental problem – we expect that our ideas can be transferred to other domains as well.
利用神经网络处理无序集是一个非常普遍和基本的问题，所以，我们希望我们的想法也能转移到其他领域。

图片部分
在这里插入图片描述

We propose a novel deep net architecture that consumes raw point cloud (set of points) without voxelization or rendering.
我们提出了一种新的深度网络架构，它使用原始点云(点集)，而不需要体素化或渲染。
It is a unified architecture that learns both global and local point features, providing a simple, efficient and effective approach for a number of 3D recognition tasks.
它是一个统一的体系结构，可以学习全局和局部点特征，为许多三维识别任务提供了一个简单、高效和有效的方法。

2.2总结

首先，作者想要设计出新的网络得调研一下传统的网络模型所具有的特点得到了如下的结果：
卷积网络和要设计的网络不太兼容，卷积网络的关键在于重用卷积核的参数，但是这里显然不太好，这里直接使用并不好，因为点云不具有相似的结构。所以不能使用。

之后，作者又把上面的原有网络不好说了一次也就是：1.voluminous 2.less respects the permutation invariance of points

然后，就是提出了PointNet，他具有:

1.简单易学的特点，我理解的原因是不需要使用对于点图的专有的人工处理。
也就是不需要对其进行技术性的转化。
2.同时也要注意到点云只是一个set，因为其排列的不变性，所以要在计算的时候做出一定的均衡。
这里提出的其实不是PointNet的特点，提出的是set of point的特点，也就是，
3.还需要考虑刚性运动的进一步不变性。
这里提出的其实不是PointNet的特点，提出的是set of point的特点，也就是，
4.文章提出的PointNet可以解决三种常见问题：1）.整张图片一个大标签 2）.图片的每个像素有一个标签3）.图片的每一部分信息有一个标签。
我理解这个大约可以解决大部分问题，例如：现在比较火的图像识别、语义分割。
5.PointNet结构很简单，因为在一开始的时候这个系统中的每个点是单独运算的。
刚读到这里的时候我觉得什么迷惑，应该是使用了某种非线性操作来完成操作的，因为线性操作完全可以用一个n*n的线性层，那样效果更好。
读完全文之后，我理解到这里平等是真的平等，一开始的操作都是针对一个点坐标在进行处理，并没有牵涉到其他点坐标。一开始理解的并不到位
6.我们的基础输入当中，只需要（x,y,z）三个输入，更多维度的信息都是网络自动帮我们填充的。
我刚刚读到这里的时候我感觉这个可能是和卷积增加channel的方式比较相似。后来在实现的时候，觉得差不多就是这样，但是其实其他方法应该也行。
7.PointNet的前几层可以提取出一些特殊的点，这些点富含信息，并且可以依据提取出他的原因对其进行编码。并且试验表明这些点往往恰好集中在物体的边界上面，同样的这也可以在理论上解释为什么网络有较好的抗干扰能力。
这个东西我理解就是提取信息，就像常见的网络的浅层网络一样，提取出信息，只是这里使用了一个很奇怪的方法，往后看看再说。
读完全文之后，还是不懂，试验一下确实会这样子。
8.最后使用一个全连接层（fully connected layers）来输出结果
这个确实没什么可说的，毕竟分类任务，最终总要全连接，就算使用FCN也逃不出全连接的使用。
9.这个东西支持刚性变换和仿射变换，所以我们可以做一些预处理
我理解一下这里就是说我们可以提前使用仿射变换对我们收集的不太好的数据进行一定的处理，或者也许我们也可以用这个特性做一些数据增强，来解决我们实际应用当中收集的数据效果不好的情况。我个人觉得这个数据增强重要一些。
复现代码的时候，发现这里理解错了，这里的情况是使用旋转网络，训练一种特殊的网络，帮助我们来处理旋转和仿射变换的问题。
10.论文作者说试验证明了这个PointNet是可以拟合任何连续的set函数。
这个话其实很大，我觉得大约是离散数据我们都可以考虑往这个方向上考虑。
11.这个网络比其他的网络训练的速度快、精确度要好。
我觉得这个效果快的主要原因是这个东西结构比较简单，参数可能比较少，所以速度快。效果好的原因，可能是他所说的没有丢失点云本身的不能观察的信息吧。

最后，文章提出了利用神经网络处理无序集是一个非常普遍和基本的问题，希望其他领域也能借鉴本文。我觉得这个东西说的确实没有问题。

另外，我觉得图片这个部分有一个非常魔幻的描述：learns both global and local point features 读到这里我是要画上一个问号的，因为在我的了解当中，要解决的的一个大问题就是局部和细节信息。一般来说，使用浅层网络，局部信息更好一些，使用深层网络，全局信息更好一点。我当时猜测应该是使用了一个跳连接，结果发现确实如此。

3.Related Work 相关工作

暂时略过

4.Problem Statement 问题描述

这里有一个经验要积累：下次读论文的时候遇见自己不太了解的领域，应该读完问题描述之后，就对之前的内容进行回看。

4.1逐句翻译

第一段
We design a deep learning framework that directly consumes unordered point sets as inputs.
我们设计了一个直接使用无序点集作为输入的深度学习框架。

A point cloud is represented as a set of 3D points {Pi| i = 1, …, n}, where each point Pi is a vector of its (x, y, z) coordinate plus extra feature channels such as color, normal etc.
点云表示为一组三维点{Pi| i = 1，…， n}，其中每个点Pi是其(x, y, z)坐标加上额外的特征通道，如颜色，法线等的一个向量。

For simplicity and clarity, unless otherwise noted, we only use the (x, y, z) coordinate as our point’s channels.
为了简单和清晰，除非另有说明，我们只使用(x, y, z)坐标作为点的通道。

第二段
For the object classification task, the input point cloud is either directly sampled from a shape or pre-segmented from a scene point cloud.
对于目标分类任务，输入点云要么直接从形状中采样，要么从场景点云中预先分割。
Our proposed deep network outputs k scores for all the k candidate classes.
我们提出的深度网络为所有的k个候选类输出k个分数

For semantic segmentation, the input can be a single object for part region segmentation, or a sub-volume from a 3D scene for object region segmentation.
对于语义分割，输入可以是用于部分区域分割的单个对象，也可以是用于对象区域分割的3D场景的一个子体。

Our model will output n × m scores for each of the n points and each of the m semantic sub-categories.
我们的模型将为这n个点和m个语义子类别输出n × m个分数。

4.2总结

明确我们的试验环境：
**1.输入点的情况：**这里所有的点都是三维坐标，也就是（x，y，z），当然还可以加入其他输入，但是这里为了简单，就只使用这个。
**2.对象识别输入图片的情况：**可以是直接传进来一张图片，也可能是之前用对象追踪等的东西提前分割的图片。
**3.语义分割输入图片的情况：**可能是整个对象，也可能是整个环境的一小部分，这个小部分包含一个对象。
**4.对象识别输出的情况：**输出k个类的分数，我理解这个分数应该是softmax层的输出，也就是实际上使各个分类的概率。
**5.语义分割的输出情况：**输出的是m*n也就是m个点，每个点n个输出，应该每个点的输出都是一个softmax，也就是各个分类的概率。

5.Deep Learning on Point Sets

这个部分真的开始谈实现了，为了真正读懂这个东西，我们先复现这个东西再来阅读：

图片部分

在这里插入图片描述
这个图片就不理解了，可以看我另外的播客，那个理解更充分一些：基于Pytorch的PointNet复现

5.0部分综述

The architecture of our network (Sec 4.2) is inspired by
the properties of point sets in Rn (Sec 4.1).
我们4.2中的网络结构来自于4.1中提出的内容

5.1.1 翻译4.1讲述的点集的特征

**综述 **
Our input is a subset of points from an Euclidean space.
It has three main properties:
我们输入的是一个来自于欧式空间的点集，他大约有三个主要方面
第一个特征
• Unordered. Unlike pixel arrays in images or voxel arrays in volumetric grids, point cloud is a set of points without specific order. In other words, a network that consumes N 3D point sets needs to be invariant to N! permutations of the input set in data feeding order.
**无序性。**与图像中的像素阵列或体素网格中的体素阵列不同，点云是一组没有特定顺序的点。换句话说，一个消耗N个3D点集的网络需要对N不变性!按数据输入顺序排列输入集。
（我感觉这个意思就是，这个是一个集合，本身应当是无序的，但是输入的数据总是有一个顺序的）
• Interaction among points. The points are from a space with a distance metric.
**点间的相互作用。**这些点来自一个有距离度量的空间。
It means that points are not isolated, and neighboring points form a meaningful subset.
这意味着点不是孤立的，相邻的点形成一个有意义的子集。
Therefore, the model needs to be able to capture local structures from nearby points, and the combinatorial interactions among local structures.
因此，模型需要能够从附近的点获取局部结构，以及局部结构之间的组合相互作用。
• Invariance under transformations. As a geometric object, the learned representation of the point set should be invariant to certain transformations.
**•变换下的不变性。**作为一个几何对象，学习到的点集表示对于某些变换应该是不变的。
For example, rotating and translating points all together should not modify the global point cloud category nor the segmentation of the points.
例如，将全部的点一起旋转和平移不应改变全局点云类别，也不应改变点云的分割。

5.1.2点集的特点总结：

1.无序性性，点集是一个点的集合，本身应当是没有顺序的，但是我们的输入总是有一个先后顺序的。
2.点之间的相互作用，点集中的每个点会受到周边的点的影响，所以我们需要让点可以和周围的点融合，我觉得这个大约是卷积的意思，只是这里的点不是有序的结构，所以我们不能使用卷积。
3.变换下的不变性：大约就是说这个图我们翻转或是一定情况下的仿射，图的类别不应当发生变化，每个点的语义分割也不应该发生变化。

5.2 翻译4.2节PointNet Architecture

Our full network architecture is visualized in Fig 2, where the classification network and the segmentation network share a great portion of structures.
我们的整个网络架构如图2所示，其中分类网络和分割网络共享了很大一部分的结构。也就是我们上面的大图。
Please read the caption of Fig 2 for the pipeline.
请阅读上面的图标2，中的传播途径pipeline
Our network has three key modules: the max pooling layer as a symmetric function to aggregate information from all the points, a local and global information combination structure, and two joint alignment networks that align both input points and point features.
我们的网络有三个关键模块:最大池化层作为一个对称函数来聚合所有点的信息，一个局部和全局的信息组合结构，以及两个联合对齐网络来对齐输入点和点特征。
We will discuss our reason behind these design choices in separate paragraphs below.
我们将在下面的单独段落中讨论这些设计选择背后的原因。

5.2.2模型结构总结

这个的情况就是，我们介绍了模型的三个组成部分：
1.最大池化层作为一个对称函数来聚合所有点的信息，（这个最大池化层很好找）其实这里最大池化层才是全文模型的核心，但是他却是这里最简单的一种结构，确实十分有趣，我觉得也许，这里可以做一些新的探索，写一些新的论文。
2.一个局部和全局的信息组合结构，（我觉得这里是说的是那个跳连接的部分），实现的时候确实如此，这里其实有一个比较好的点，文章在将两部分合并之后，又在这个之后加入了一个全连接层组对其进行处理，我觉得这个过程的好处是通过训练，为加入的部分选定了一个合适的权重。
（我一开始理解这个mlp是一个线性层，我们真正复现他的时候，发现不行，不能使用线性层，需要因为真正变换的dim=-2，而线性层是处理dim=-1的，所以需要使用1乘1的卷积。）
在这里插入图片描述

3.以及两个联合对齐网络来对齐输入点和点特征。（这里指的是T-Net）

在这里插入图片描述

5.3Symmetry Function for Unordered Input

5.3.1对称函数的说明部分翻译

需要用到的图片：Fig5
在这里插入图片描述
Figure 5. Three approaches to achieve order invariance.
三种实现阶不变性的方法。
Multilayer perceptron (MLP) applied on points consists of 5 hidden layers with neuron sizes 64,64,64,128,1024, all points share a single copy of MLP. The MLP close to the output consists of two layers with sizes 512,256.
应用于点的多层感知器(MLP)由5个隐藏层组成，神经元大小分别为64、64、64、128、1024，所有点共享一个单一副本的MLP。靠近输出的MLP由两层组成，大小为512,256。
总结起来就是：maxpool的效果比他们都好。

这部分在详细介绍对称函数处理无序点集的问题
第一段
In order to make a model invariant to input permutation, three strategies exist:
为了使模型不变的输入排列，我们提出来了三个可以选择的策略:

sort input into a canonical order;
将输入按照规范的顺序排列
treat the input as a sequence to train an RNN, but augment the training data by all kinds of permutations;
将输入作为一个序列来训练RNN，但通过各种排列来增强训练数据;
use a simple symmetric function to aggregate the information from each point. Here, a symmetric function takes n vectors as input and outputs a new vector that is invariant to the input order.
使用一个简单的对称函数来聚合每个点的信息。这里，一个对称函数取n个向量作为输入，然后输出一个新的向量，这个向量对输入顺序不变。
For example, + and ∗ operators are symmetric binary
functions.
例如，+和∗运算符是对称的二元函数。

第二段
While sorting sounds like a simple solution, in high dimensional space there in fact does not exist an ordering that is stable w.r.t. point perturbations in the general sense.

虽然排序听起来像一个简单的解决方案，但在高维空间中，实际上并不存在一种有序，即一般意义上稳定的w.r.t.点扰动。

This can be easily shown by contradiction. If such an ordering strategy exists, it defines a bijection map between a high-dimensional space and a 1d real line.

这很容易通过矛盾来证明。如果存在这样的排序策略，它定义了高维空间和一维实线之间的双射映射。

It is not hard to see, to require an ordering to be stable w.r.t point perturbations is equivalent to requiring that this map preserves spatial proximity as the dimension reduces, a task that cannot be achieved in the general case.
不难看出，要求w。r。t点摄动的有序是稳定的，就等于要求该映射在维数减少时保持空间邻近性，显然在一般情况下这是无法完成的任务。
Therefore, sorting does not fully resolve the ordering issue, and it’s hard for a network to learn a consistent mapping from input to output as the ordering issue persists.
因此，排序并不能完全解决排序问题，而且当排序问题持续存在时，网络很难学会从输入到输出的一致映射。
As shown in experiments (Fig 5), we find that applying a MLP directly on the sorted point set performs poorly, though slightly better than directly processing an unsorted input.
如图5所示，我们发现直接在排序的点集上应用MLP的性能较差，但略好于直接处理未排序的输入。
第三段
The idea to use RNN considers the point set as a sequential signal and hopes that by training the RNN with randomly permuted sequences, the RNN will become invariant to input order.
使用RNN的思想认为点集是一个序列信号，并希望通过用随机排列的序列训练RNN, RNN将成为不变的输入顺序。
However in “OrderMatters” [22] the authors have shown that order does matter and cannot be totally omitted.
然而，在“OrderMatters”[22]中，作者已经证明了顺序是重要的，不能被完全忽略。
While RNN has relatively good robustness to input ordering for sequences with small length (dozens), it’s hard to scale to thousands of input elements, which is the common size for point sets.
RNN对小长度序列(几十个)的输入排序具有较好的鲁棒性，但难以伸缩到数千个输入元素，这是点集的常见规模。
Empirically, we have also shown that model based on RNN does not perform as well as our proposed method (Fig 5)
经验上，我们也证明了基于RNN的模型并没有我们所提出的方法表现的好。
第四段
Our idea is to approximate a general function defined on
a point set by applying （这里应当翻译为变形）a symmetric function on transformed elements in the set:
我们的想法是通过变形一个元素集合的对称函数来拟合一个在无序点集上的一般函数。
文章给了一个函数的表达式如下：
在这里插入图片描述
第五段
Empirically, our basic module is very simple: we approximate h by a multi-layer perceptron network and g by a composition of a single variable function and a max pooling function.
根据经验，我们的基本模块非常简单:我们用多层感知器网络近似h，用单变量函数和最大池化函数的组合近似g。
This is found to work well by experiments. Through a collection of h, we can learn a number of f’s to capture different properties of the set.
实验证明这是有效的。通过一个h函数的集合集合，我们可以学习若干个f来捕捉集合的不同性质。

第六段
While our key module seems simple, it has interesting properties (see Sec 5.3) and can achieve strong performace (see Sec 5.1) in a few different applications.
虽然我们的key模块看起来很简单，但它具有有趣的属性(见章节5.3)，并且可以在一些不同的应用程序中实现强大的性能(见章节5.1)。
Due to the simplicity of our module, we are also able to provide theoretical analysis as in Sec 4.3.
由于我们的模块很简单，我们也可以像4.3节那样提供理论分析。

5.3.2对称函数的总结

文章提出来三种方法，但是比较起来还是最后这个好用，

1）排列这个顺序，好不好，可不可以实现？
从深度学习的角度来说：这个很难学习，不是我们所希望看到的。
我个人理解这个东西是不可能的，因为排序之后相当于原先没有顺序的东西你给他们搞了个顺序，有了顺序他们就需要使用不同的参数进行处理，这样就增加了参数的个数，这显然不是我们所希望看到的。
但是论文并没有说明原因，而是进行一个试验来进行论证，我觉得这是我需要学习的一个地方，当道理说不清楚的时候，或者说不透彻的时候我们应当适当引入试验来弥补理论的不足。
从几何意义的角度来说：准确来说，这个点就不存在严格意义上的顺序。
论文当中说了一个问题：“就是如果存在一个方法使得其可以严格排序，那么我们就可以将任意的三维空间的点直接映射到一维空间上（也就是映射到一个数轴上）这个显然是不靠谱的。”
我觉得这个可以这样理解，首先我们要排序一个点集，那么我们必须对每个点对应一个不同的数值，我们不难证明高维度的点的个数是一定大于数轴上点的个数的，所以这个映射显然不成立。
2）使用有序RNN，之后再增强数据，这样做好不好？这个过程会让数据集扩大阶乘倍，需要的计算量也将飞速上升，但是，最后的结果并不是提取了有用的信息，只是让RNN变得对于所有输入平衡了。
但是这里存在两个问题：
1.RNN自身很难做到对数据顺序的安全无感
2.RNN的使用范围就那么长（最长就在输入几十个的时候表现良好），太长的序列他有点顶不住
3）接下来文章又在理论上证明了自己模型的合理性。就是证明使用这种对称函数加上多层感知机可以拟合任意的集合上的函数。

5.4Local and Global Information Aggregation局部信息和全局信息的聚合

5.4.1局部信息和全局信息部分的翻译

第一段
The output from the above section forms a vector [f1, . . . , fK], which is a global signature of the input set.
以上部分的输出形成了一个向量[f1，…， fK]，它是输入集的全局信号。也就是说上面的所述的函数的最终输出是一个全局的信息。
We can easily train a SVM or multi-layer perceptron classifier on the shape global features for classification.
我们可以很容易地训练一个支持向量机或多层感知器分类器对形状的全局特征进行分类。
However, point segmentation requires a combination of local and global knowledge. We can achieve this by a simple yet highly effective manner.
然而，点语义分割需要结合局部知识和全局知识。我们可以通过一种简单而高效的方式来实现这一目标。
第二段
Our solution can be seen in Fig 2 (Segmentation Network).
我们的解决方案如图2 (Segmentation Network)所示。
After computing the global point cloud feature vector, we feed it back to per point features by concatenating the global feature with each of the point features.
在计算出全局点云特征向量后，通过将全局特征与每个点云特征连接起来，将其反馈到每个点云特征。
Then we extract new per point features based on the combined point features - this time the per point feature is aware of both the local and global information.
然后在新的点对（老的点和新的点）信息的基础上提取新的单点特征，这一次单点特征同时具有局部信息和全局信息。
第三段
With this modification our network is able to predict per point quantities that rely on both local geometry and global semantics.
通过这种修改，我们的网络能够预测依赖于局部几何和全局语义的每个点的数量。
For example we can accurately predict per-point normals (fig in supplementary), validating that the network is able to summarize information from the point’s local neighborhood.
例如，我们可以准确地预测每个点的法线(图在补充)，验证网络能够从点的局部邻域总结信息。
In experiment session, we also show that our model can achieve state-of-the-art performance on shape part segmentation and scene segmentation.
实验结果表明，该模型在形状分割和场景分割方面均能取得较好的效果。

5.4.2局部信息和全局信息的总结

总结起来就是这里没有创新：
这里还是使用一个传统的跳连接来完成两种信息的融合。（这个是传统方法）。

5.5 Joint Alignment Network联合定位网络（主要是讲图像旋转的问题）

5.5.1翻译部分

第一段
The semantic labeling of a point cloud has to be invariant if the point cloud undergoes certain geometric transformations, such as rigid transformation.
如果点云经过一定的几何变换(如刚性变换)，则点云的语义标记必须是不变的。（也就是怎么转语义分割都不变）
We therefore expect that the learnt representation by our point set is invariant to these transformations.
因此，我们期望我们的点集所学习的表示对于这些变换是不变的。

第二段
A natural solution is to align all input set to a canonical space before feature extraction. Jaderberg et al. [9] introduces the idea of spatial transformer to align 2D images through sampling and interpolation, achieved by a specifically tailored layer implemented on GPU.
一个自然的解决方案是在特征提取之前将所有的输入集对齐到一个规范空间。Jaderberg等人的[9]引入了空间变换的思想，通过采样和插值来对齐2D图像，通过在GPU上实现专门定制的层来实现。

第三段
Our input form of point clouds allows us to achieve this goal in a much simpler way compared with [9].
我们的点云输入形式允许我们以比[9]简单得多的方式实现这一目标。
We do not need to invent any new layers and no alias is introduced as in the image case.
我们不需要创造任何新的图层，也不需要像图片那样引入别名。
We predict an affine（仿射） transformation matrix by a mini-network (T-net in Fig 2) and directly apply this transformation to the coordinates（坐标） of input points.
我们通过一个微型网络(图2中的T-net)预测一个仿射变换矩阵，并直接将这个变换应用到输入点的坐标上。
The mininetwork itself resembles the big network and is composed by basic modules of point independent feature extraction, max pooling and fully connected layers.
微型网络本身类似于大网络，由点独立特征提取、最大的池化和完全连接的层。（微型网络除了自己特别突出的模拟仿射变换的情况以外，其他的特征其实和整体网络的设计类似）
More details about the T-net are in the supplementary.
更多关于T-net的细节在补充中。

第四段
This idea can be further extended to the alignment of feature space, as well. We can insert another alignment net-work on point features and predict a feature transformation matrix to align features from different input point clouds. However, transformation matrix in the feature space has much higher dimension than the spatial transform matrix, which greatly increases the difficulty of optimization. We therefore add a regularization term to our softmax training loss. We constrain the feature transformation matrix to be close to orthogonal matrix:
在这里插入图片描述

第五段
where A is the feature alignment matrix predicted by a mini-network. An orthogonal transformation will not lose information in the input, thus is desired. We find that by adding the regularization term, the optimization becomes more stable and our model achieves better performance.

5.5.2总结部分

这里提出来了一个转换的问题:就是转换不变性，目标分类和语义识别都不随着旋转改变，所以我们的网络也应当具有旋转不变性。
这里举例了一个之前的方法：之前有人提出来了一种标准化的方法，我没有看过这个文章，所以我大约认为这个东西是一种预处理，这种预处理，我理解里他是违背了深度学习的整体思想的，并且一般来看应该结果也不会特别好。
后来去看了论文本来作者的视频，对这里有了更深的理解，之前的人主要是作了两件事情：

1.直接将三维空间的点集映射到一个二维空间里，这样就可以使用我们在二维空间中惯用的卷积等各种操作完成我们需要的任务。但是，显然这个映射的过程中，会丢失很多本身的特征。这个映射我觉得有两步：
首先.就是之前我谈过的一一对应的问题，如果3d的点可以于2d的点一一对应，那么一定不可能，因为3d当中有更多的点。
其次，我们真的想要使用图像处理的方法，那么我们就得把点云这个孤立的点，转化为一个连续的图片。
2.手工从三维空间提取信息，这个大家就该都懂了，深度学习的指导思想就是尽量避免手工的操作。

讲了本文怎么实现的：本文最后使用一个T-Net的操作其实是一个矩阵变换，这个网络设计的过程中其他的pooling和mlp和整体的网络特点都是一致的。

5.6理论分析Theoretical Analysis

这个我将在另外一篇博客和大家探讨，如果你对只是使用，其实不太需要阅读理论。

CUHK-SZ-relu

关注

10
点赞
踩
26

收藏

觉得还不错? 一键收藏
打赏
2
评论
点云网络的论文理解（一）-点云网络的提出 PointNet : Deep Learning on Point Sets for 3D Classification and Segmentation

1.摘要1.1逐句翻译Point cloud is an important type of geometric data structure.点云是一种重要的数据结构。Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images.由于其格式不规则，大多数研究人员将此类数据转换为规则的3D体素网格或图像集合。This, how
复制链接

扫一扫