PointNet++: PointNet for Fine-grained Features

EverNoob

已于 2022-05-09 11:03:28 修改

阅读量499

点赞数

分类专栏： Machine_Learning Algorithm 文章标签：深度学习人工智能目标检测

于 2022-05-07 21:19:18 首次发布

原文链接：https://zhuanlan.zhihu.com/p/266324173

版权

Machine_Learning 同时被 2 个专栏收录

54 篇文章 1 订阅

订阅专栏

Algorithm

43 篇文章 0 订阅

订阅专栏

Basics: PointNet

PointNet_EverNoob的博客-CSDN博客

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space

PointNet++

[arXiv version] [Code and Data (GitHub)]

Abstract

Few prior works study deep learning on point sets. PointNet by Qi et al. is a pioneer in this direction. However, by design PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes. In this work, we introduce a hierarchical neural network that applies PointNet recursively on a nested partitioning of the input point set. By exploiting metric space distances, our network is able to learn local features with increasing contextual scales. With further observation that point sets are usually sampled with varying densities, which results in greatly decreased performance for networks trained on uniform densities, we propose novel set learning layers to adaptively combine features from multiple scales. Experiments show that our network called PointNet++ is able to learn deep point set features efficiently and robustly. In particular, results significantly better than state-of-the-art have been obtained on challenging benchmarks of 3D point clouds.

Metric Space

In mathematics, a metric space is a non-empty set together with a metric on the set. The metric is a function that defines a concept of distance between any two members of the set, which are usually called points. The metric satisfies the following properties:

the distance from A to B is zero if and only if A and B are the same point,
the distance between two distinct points is positive,
the distance from A to B is the same as the distance from B to A, and
the distance from A to B is less than or equal to the distance from A to B via any third point C.

A metric on a space induces topological properties like open and closed sets, which lead to the study of more abstract topological spaces.

Figure 1. PointNet++ Architecture for Point Set Segmentation and Classification. We introduce a type of novel neural network, named as PointNet++, to process a set of points sampled in a metric space in a hierarchical fashion (2D points in Euclidean space are used for this illustration). The general idea of PointNet++ is simple. We first partition the set of points into overlapping local regions by the distance metric of the underlying space. Similar to CNNs, we extract local features capturing fine geometric structures from small neighborhoods; such local features are further grouped into larger units and processed to produce higher level features. This process is repeated until we obtain the features of the whole point set.

Figure 2. Visualization of Point Cloud Patterns Learned from PointNet++. Patterns learnt from 20 (out of the 1,024) neurons in the first level are shown. We visualize the point cloud patterns learnt by searching for point clouds (in unit sphere) that activate the neurons the most. Since The model is trained for ModelNet40 shape classification which contains mostly furniture, we see clear structures of planes, double planes, lines, corners etc. Color in figure indicates point depth (red is near, blue is far).

Quick View

https://medium.com/@sanketgujar95/https-medium-com-sanketgujar95-pointnetplus-5d2642560c0d

Breakdown

搞懂PointNet++，这篇文章就够了！ - 知乎

1 motivation

从名字都能看得出来是对PointNet的改进，迭代版本

关于PointNet的介绍，可以看我这篇文章：

刘昕宸：细嚼慢咽读论文：点云特征学习开天辟地PointNet209 赞同 · 17 评论文章正在上传…重新上传取消

PointNet does not capture local structures induced by the metric space points live in, limiting its ability to recognize fine-grained patterns and generalizability to complex scenes.
PointNet因为是只使用了MLP和max pooling，没有能力捕获局部结构，因此在细节处理和泛化到复杂场景上能力很有限。

我总结的PointNet的几个问题：

point-wise MLP，仅仅是对每个点表征，对局部结构信息整合能力太弱 --> PointNet++的改进：sampling和grouping整合局部邻域
global feature直接由max pooling获得，无论是对分类还是对分割任务，都会造成巨大的信息损失 --> PointNet++的改进：hierarchical feature learning framework，通过多个set abstraction逐级降采样，获得不同规模不同层次的local-global feature
分割任务的全局特征global feature是直接复制与local feature拼接，生成discriminative feature能力有限 --> PointNet++的改进：分割任务设计了encoder-decoder结构，先降采样再上采样，使用skip connection将对应层的local-global feature拼接

2 solution

PointNet++的网络大体是encoder-decoder结构

encoder为降采样过程，通过多个set abstraction结构实现多层次的降采样，得到不同规模的point-wise feature，最后一个set abstraction输出可以认为是global feature。其中set abstraction由sampling，grouping，pointnet三个模块构成。

decoder根据分类和分割应用，又有所不同。分类任务decoder比较简单，不介绍了。分割任务decoder为上采样过程，通过反向插值和skip connection 实现在上采样的同时，还能够获得local+global的point-wise feature，使得最终的表征能够discriminative

==> "skip link concatenation", the paper's author never used "skip connection"; might be a mistake by the article, whose author is clearly not an English speaker;

===> the paper use the term in the sense: "The interpolated features on N_l-1 points are then concatenated with skip linked point features from the set abstraction level." so "skip connection" is a bad expression, be aware and mentally replace it with the correct "skip link concatenation".

因此在往下看之前，我们最好带着2个问题：

PointNet++降采样过程是怎么实现的？/PointNet++是如何表征global feature的？（关注set abstraction, sampling layer, grouping layer, pointnet layer）
PointNet++用于分割任务的上采样过程是怎么实现的？/PointNet++是如何表征用于分割任务的point-wise feature的？（关注反向插值，skip connection）

下面我将就着代码，详细介绍PointNet++网络是如何前向传播的（也就是网络究竟在干什么），这对于了解网络的设计至关重要。

声明： d 表示坐标空间维度，C 表示特征空间维度

2.1 encoder

在PointNet的基础上增加了hierarchical feature learning framework的结构。这种多层次的结构由set abstraction层组成。

在每一个层次的set abstraction，点集都会被处理和抽象，而产生一个规模更小的点集，可以理解成是一个降采样表征过程，可参考上图左半部分。

set abstraction由三个部分构成（代码贴在下面）：

def pointnet_sa_module(xyz, points, npoint, radius, nsample, mlp, mlp2, group_all, is_training, bn_decay, scope, bn=True, pooling='max', knn=False, use_xyz=True, use_nchw=False):
    ''' PointNet Set Abstraction (SA) Module
        Input:
            xyz: (batch_size, ndataset, 3) TF tensor
            points: (batch_size, ndataset, channel) TF tensor
            npoint: int32 -- #points sampled in farthest point sampling
            radius: float32 -- search radius in local region
            nsample: int32 -- how many points in each local region
            mlp: list of int32 -- output size for MLP on each point
            mlp2: list of int32 -- output size for MLP on each region
            group_all: bool -- group all points into one PC if set true, OVERRIDE
                npoint, radius and nsample settings
            use_xyz: bool, if True concat XYZ with local point features, otherwise just use point features
            use_nchw: bool, if True, use NCHW data format for conv2d, which is usually faster than NHWC format
        Return:
            new_xyz: (batch_size, npoint, 3) TF tensor
            new_points: (batch_size, npoint, mlp[-1] or mlp2[-1]) TF tensor
            idx: (batch_size, npoint, nsample) int32 -- indices for local regions
    '''
    data_format = 'NCHW' if use_nchw else 'NHWC'
    with tf.variable_scope(scope) as sc:
        # Sample and Grouping
        if group_all:
            nsample = xyz.get_shape()[1].value
            new_xyz, new_points, idx, grouped_xyz = sample_and_group_all(xyz, points, use_xyz)
        else:
            new_xyz, new_points, idx, grouped_xyz = sample_and_group(npoint, radius, nsample, xyz, points, knn, use_xyz)
        # Point Feature Embedding
        if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
        for i, num_out_channel in enumerate(mlp):
            new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
                                        padding='VALID', stride=[1,1],
                                        bn=bn, is_training=is_training,
                                        scope='conv%d'%(i), bn_decay=bn_decay,
                                        data_format=data_format) 
        if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])
        # Pooling in Local Regions
        if pooling=='max':
            new_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
        elif pooling=='avg':
            new_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
        elif pooling=='weighted_avg':
            with tf.variable_scope('weighted_avg'):
                dists = tf.norm(grouped_xyz,axis=-1,ord=2,keep_dims=True)
                exp_dists = tf.exp(-dists * 5)
                weights = exp_dists/tf.reduce_sum(exp_dists,axis=2,keep_dims=True) # (batch_size, npoint, nsample, 1)
                new_points *= weights # (batch_size, npoint, nsample, mlp[-1])
                new_points = tf.reduce_sum(new_points, axis=2, keep_dims=True)
        elif pooling=='max_and_avg':
            max_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
            avg_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
            new_points = tf.concat([avg_points, max_points], axis=-1)
        # [Optional] Further Processing 
        if mlp2 is not None:
            if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
            for i, num_out_channel in enumerate(mlp2):
                new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
                                            padding='VALID', stride=[1,1],
                                            bn=bn, is_training=is_training,
                                            scope='conv_post_%d'%(i), bn_decay=bn_decay,
                                            data_format=data_format) 
            if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])
        new_points = tf.squeeze(new_points, [2]) # (batch_size, npoints, mlp2[-1])
        return new_xyz, new_points, idx

2.1.1 sampling layer

class FarthestSampler:
    def __init__(self):
        pass
    def _calc_distances(self, p0, points):
        return ((p0 - points) ** 2).sum(axis=1)
    def __call__(self, pts, k):
        farthest_pts = np.zeros((k, 3), dtype=np.float32)
        farthest_pts[0] = pts[np.random.randint(len(pts))]
        distances = self._calc_distances(farthest_pts[0], pts)
        for i in range(1, k):
            farthest_pts[i] = pts[np.argmax(distances)]
            distances = np.minimum(
                distances, self._calc_distances(farthest_pts[i], pts))
        return farthest_pts

sampling和grouping具体实现是写在一个函数里的：

def sample_and_group(npoint, radius, nsample, xyz, points, knn=False, use_xyz=True):
    '''
    Input:
        npoint: int32
        radius: float32
        nsample: int32
        xyz: (batch_size, ndataset, 3) TF tensor
        points: (batch_size, ndataset, channel) TF tensor, if None will just use xyz as points
        knn: bool, if True use kNN instead of radius search
        use_xyz: bool, if True concat XYZ with local point features, otherwise just use point features
    Output:
        new_xyz: (batch_size, npoint, 3) TF tensor
        new_points: (batch_size, npoint, nsample, 3+channel) TF tensor
        idx: (batch_size, npoint, nsample) TF tensor, indices of local points as in ndataset points
        grouped_xyz: (batch_size, npoint, nsample, 3) TF tensor, normalized point XYZs
            (subtracted by seed point XYZ) in local regions
    '''
    new_xyz = gather_point(xyz, farthest_point_sample(npoint, xyz)) # (batch_size, npoint, 3)
    if knn:
        _,idx = knn_point(nsample, xyz, new_xyz)
    else:
        idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)
    grouped_xyz = group_point(xyz, idx) # (batch_size, npoint, nsample, 3)
    grouped_xyz -= tf.tile(tf.expand_dims(new_xyz, 2), [1,1,nsample,1]) # translation normalization
    if points is not None:
        grouped_points = group_point(points, idx) # (batch_size, npoint, nsample, channel)
        if use_xyz:
            new_points = tf.concat([grouped_xyz, grouped_points], axis=-1) # (batch_size, npoint, nample, 3+channel)
        else:
            new_points = grouped_points
    else:
        new_points = grouped_xyz
    return new_xyz, new_points, idx, grouped_xyz

其中sampling对应的部分是：

new_xyz = gather_point(xyz, farthest_point_sample(npoint, xyz)) # (batch_size, npoint, 3)

def farthest_point_sample(npoint,inp):
    '''
input:
    int32
    batch_size * ndataset * 3   float32
returns:
    batch_size * npoint         int32
    '''
    return sampling_module.farthest_point_sample(inp, npoint)

gather_point的作用就是将上面输出的索引，转化成真正的点云

def gather_point(inp,idx):
    '''
input:
    batch_size * ndataset * 3   float32
    batch_size * npoints        int32
returns:
    batch_size * npoints * 3    float32
    '''
    return sampling_module.gather_point(inp,idx)

以上两个函数再往下的具体实现是cuda，后面有时间再分析。

2.1.2 grouping layer

if knn:
    _,idx = knn_point(nsample, xyz, new_xyz)
else:
    idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)
    grouped_xyz = group_point(xyz, idx) # (batch_size, npoint, nsample, 3)

2点注意：

1)找邻域的过程也是在坐标空间进行（也就是以上代码输入输出维度都是 d ，没有 C， C 是在后面的代码拼接上的），而不是特征空间。

2)找邻域这里有两种方式：KNN和query ball point.

其中前者KNN就是大家耳熟能详的K近邻，找K个坐标空间最近的点。
后者query ball point就是划定某一半径，找在该半径球内的点作为邻点。

还有个问题：query ball point 如何保证对于每个局部邻域，采样点的数量都是一样的呢？
事实上，如果query ball的点数量大于规模 K ，那么直接取前 K 个作为局部邻域；如果小于，那么直接对某个点重采样，凑够规模 K

KNN和query ball的区别：（摘自原文）Compared with kNN, ball query's local neighborhood guarantees a fixed region scale thus making local region feature more generalizable across space, which is preferred for tasks requiring local pattern recognition (e.g. semantic point labeling).也就是query ball更加适合于应用在局部/细节识别的应用上，比如局部分割。

补充材料中也有实验来对比KNN和query ball:

sample_and_group代码的剩余部分：

sample和group操作都是在坐标空间进行的，因此如果还有特征空间信息（即point-wise feature），可以在这里将其与坐标空间拼接，组成新的point-wise feature，准备送入后面的unit point进行特征学习。

if points is not None:
    grouped_points = group_point(points, idx) # (batch_size, npoint, nsample, channel)
    if use_xyz:
        new_points = tf.concat([grouped_xyz, grouped_points], axis=-1) # (batch_size, npoint, nample, 3+channel)
    else:
        new_points = grouped_points
else:
    new_points = grouped_xyz

2.1.3 PointNet layer

"表征" ==> extract features

以下代码主要分成3个部分：

1)point feature embedding

2)pooling in local regions

3)further processing

针对第一部分

针对第二部分

针对第三部分

# Point Feature Embedding
if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
for i, num_out_channel in enumerate(mlp):
    new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
                                padding='VALID', stride=[1,1],
                                bn=bn, is_training=is_training,
                                scope='conv%d'%(i), bn_decay=bn_decay,
                                data_format=data_format) 
if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])

# Pooling in Local Regions
if pooling=='max':
    new_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
elif pooling=='avg':
    new_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
elif pooling=='weighted_avg':
    with tf.variable_scope('weighted_avg'):
        dists = tf.norm(grouped_xyz,axis=-1,ord=2,keep_dims=True)
        exp_dists = tf.exp(-dists * 5)
        weights = exp_dists/tf.reduce_sum(exp_dists,axis=2,keep_dims=True) # (batch_size, npoint, nsample, 1)
        new_points *= weights # (batch_size, npoint, nsample, mlp[-1])
        new_points = tf.reduce_sum(new_points, axis=2, keep_dims=True)
elif pooling=='max_and_avg':
    max_points = tf.reduce_max(new_points, axis=[2], keep_dims=True, name='maxpool')
    avg_points = tf.reduce_mean(new_points, axis=[2], keep_dims=True, name='avgpool')
    new_points = tf.concat([avg_points, max_points], axis=-1)

# [Optional] Further Processing 
if mlp2 is not None:
    if use_nchw: new_points = tf.transpose(new_points, [0,3,1,2])
    for i, num_out_channel in enumerate(mlp2):
        new_points = tf_util.conv2d(new_points, num_out_channel, [1,1],
                                    padding='VALID', stride=[1,1],
                                    bn=bn, is_training=is_training,
                                    scope='conv_post_%d'%(i), bn_decay=bn_decay,
                                    data_format=data_format) 
    if use_nchw: new_points = tf.transpose(new_points, [0,2,3,1])

2.1.4 Encoder还有一个问题

pointnet++实际上就是对局部邻域表征。

那就不得不面对一个挑战：non-uniform sampling density，也就是在稀疏点云局部邻域训练可能不能很好挖掘点云的局部结构

PointNet++做法：learn to combine features from regions of different scales when the input sampling density changes.

因此文章提出了两个方案：

一、Multi-scale grouping（MSG）

对当前层的每个中心点，取不同radius的query ball，可以得到多个不同大小的同心球，也就是得到了多个相同中心但规模不同的局部邻域，分别对这些局部邻域表征，并将所有表征拼接。如上图所示。

代码层面其实就是加了个遍历radius_list的循环，分别处理，并最后concat

new_xyz = gather_point(xyz, farthest_point_sample(npoint, xyz))
new_points_list = []
for i in range(len(radius_list)):
    radius = radius_list[i]
    nsample = nsample_list[i]
    idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)
    grouped_xyz = group_point(xyz, idx)
    grouped_xyz -= tf.tile(tf.expand_dims(new_xyz, 2), [1,1,nsample,1])
    if points is not None:
        grouped_points = group_point(points, idx)
        if use_xyz:
            grouped_points = tf.concat([grouped_points, grouped_xyz], axis=-1)
    else:
        grouped_points = grouped_xyz
    if use_nchw: grouped_points = tf.transpose(grouped_points, [0,3,1,2])
    for j,num_out_channel in enumerate(mlp_list[i]):
        grouped_points = tf_util.conv2d(grouped_points, num_out_channel, [1,1],
                                        padding='VALID', stride=[1,1], bn=bn, is_training=is_training,
                                        scope='conv%d_%d'%(i,j), bn_decay=bn_decay)
    if use_nchw: grouped_points = tf.transpose(grouped_points, [0,2,3,1])
    new_points = tf.reduce_max(grouped_points, axis=[2])
    new_points_list.append(new_points)
new_points_concat = tf.concat(new_points_list, axis=-1)

二、Multi-resolution grouping（MRG）

（摘自原文）features of a region at some level L_i is a concatenation of two vectors.

One vector (left in figure) is obtained by summarizing the features at each sub-region from the lower level L_i-1 using the set abstraction level.

The other vector (right) is the feature that is obtained by directly processing all raw points in the local region using a single PointNet.

简单来说，就是当前set abstraction的局部邻域表征由两部分构成：

左边表征：对上一层set abstraction（还记得上一层的点规模是更大的吗？）各个局部邻域（或者说中心点）的特征进行聚合

右边表征：使用一个单一的PointNet直接在局部邻域处理原始点云

2.2 decoder：

2.2.1 分类任务的decoder

比较简单，将encoder降采样得到的global feature送入几层全连接网络，最后通过一个softmax分类。

2.2.2 分割任务的decoder

经过前半部分的encoder，我们得到的是global feature，或者是极少数点的表征（其实也就是global feature)

而如果做分割，我们需要的是point-wise feature，这可怎么办呢？

PointNet处理思路很简单，直接把global feature复制并与之前的local feature拼接，使得这个新point-wise feature能够获得一定程度的“邻域”信息。这种简单粗暴的方法显然并不能得到很discriminative的表征

别急，PointNet++来了。

PointNet++设计了一种反向插值的方法来实现上采样的decoder结构，通过反向插值和skip connection来获得discriminative point-wise feature：

一、反向插值具体做法：

==> Eq.2 refers to the equation above

==> d(x, x_i) is distance between 2 points amid the point cloud

二、~~skip connection~~ skip link concatenation具体做法：

2.3 loss

无论是分类还是分割应用，本质上都是分类问题，因此loss就是分类任务中常用的交叉熵loss

2.4 其他的问题

Q：PointNet++梯度是如何回传的？？？

A：PointNet++ fps实际上并没有参与梯度计算和反向传播。

可以理解成是PointNet++将点云进行不同规模的fps降采样，事先将这些数据准备好，再送到网络中去训练的

3 dataset and experiments

3.1 dataset

MNIST: Images of handwritten digits with 60k training and 10k testing samples.（用于分类）
ModelNet40: CAD models of 40 categories (mostly man-made). We use the official split with 9,843 shapes for training and 2,468 for testing. （用于分类）
SHREC15: 1200 shapes from 50 categories. Each category contains 24 shapes which are mostly organic ones with various poses such as horses, cats, etc. We use five fold cross validation to acquire classification accuracy on this dataset. （用于分类）
ScanNet: 1513 scanned and reconstructed indoor scenes. We follow the experiment setting in [5] and use 1201 scenes for training, 312 scenes for test. （用于分割）

3.2 experiments

主要关心的实验结果是2个：

ModelNet40分类结果
ShapeNet Part分割结果

在补充材料中PointNet++也做了ShapeNet part数据集上的part segmentation：

4 conclusion

第一次看到PointNet++网络结构，觉得设计得非常精妙，特别是设计了上采样和下采样的具体实现方法，并以此用于分割任务的表征，觉得设计得太漂亮了。但其实无论是分类还是分割任务，提升幅度较PointNet也就是1-2个点而已。==> PointNet is already quite good at classification, so no surprise there; as for segmentation, we see from table 4 that PointNet++ is not consistently better than PointNet, but it provides a ~5% improvement for certain shapes; overall it's fair to say that PointNet already performs at a high level; the main contribution from PointNet++ is a controllable interpolation process that retrieves regional structural information as needed (maybe using a larger k and make p a hyperparameter can give consistently better results??).

PointNet++，特别是其前半部分encoder，提供了非常好的表征网络，后面很多点云处理应用的论文都会使用到PointNet++作为它们的表征器。