RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds 学习笔记_cloud segmentation of large scenic-CSDN博客

本文链接：https://blog.csdn.net/weixin_37804469/article/details/106048236

这篇paper的目的：为了解决随机采样带来的弊端，提出局部特征聚合模型（local feature aggregation module）

Abstract

我们研究了大规模三维点云的有效语义分割问题。通过依赖昂贵的采样技术或计算繁重的预处理/后处理步骤，现在大多数现有的方法只能在小规模点云上进行训练和操作。在本文中，我们介绍了一种高效、轻量级的神经结构RandLA-Net，用于直接推断大规模点云的每点语义。我们的方法的关键是使用随机点抽样而不是更复杂的点选择方法。尽管随机抽样具有很高的计算效率和内存效率，但它也会随机地丢弃一些关键特性（flaw）。为了克服这个问题，我们引入了一种新的局部特征聚合模型来逐步增加每个3D点的接受域，从而有效地保留几何细节。大量的实验表明，我们的RandLA-Net一次可以处理100万个点，比现有方法快200倍。此外，我们的RandLA-Net在语义分割的两个大规模基准上明显超过了最先进的方法Semantic3D和Se- manticKITTI。

1. Introduction

motivation：（challenge）原始点云通常是不规则采样、非结构化和无序的

Efficient semantic segmentation of large-scale 3D point clouds is a fundamental and essential capability for real-time intelligent systems, such as autonomous driving and augmented reality. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Although deep convolutional networks show excellent performance in structured 2D computer vision tasks, they cannot be directly applied to this type of unstructured data.

介绍了近期提出的很好的可以直接处理点云的技术PointNet；
介绍PointNet的弊端（对于每个点，不能抓取更广泛的临近信息），为了克服这个弊端，出现的相应的技术解决方案
这些解决方案的弊端（都只处理small-scale point cloud，很少能处理large-scale的），以及造成这些问题的原因

Recently, the pioneering work PointNet [37] has emerged as a promising approach for directly processing 3D point clouds. It learns per-point features using shared multi-layer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point. To learn richer local structures, many dedicated neural modules have been subsequently and rapidly introduced. These modules can be generally categorized as:

neighbouring feature pooling [38, 28, 19, 64, 63]
graph message passing [51, 42, 49, 50, 5, 20, 30]
kernel-based convolution [43, 18, 54, 26, 21, 22, 48, 33]
attention-based aggregation [55, 62, 60, 36].

Although these approaches achieve impressive results for object recognition and semantic segmentation, almost all of them are limited to extremely small 3D point clouds (e.g., 4k points or 1×1 meter blocks) and cannot be directly extended to larger point clouds (e.g., millions of points and up to 200×200 meters).

The reasons for this limitation are three-fold.

The commonly used point-sampling methods of these networks are either computationally expensive or memory inefficient. For example, the widely employed farthest-point sampling [38] takes over 200 seconds to sample 10% of 1 million points.
Most existing local feature learners usually rely on computationally expensive kernelisation or graph construction, thereby being unable to process massive number of points.
For a large-scale point cloud, which usually consists of hundreds of objects, the existing local feature learners are either incapable of capturing complex structures, or do so inefficiently, due to their limited size of receptive fields.

介绍现有的处理large-scale point cloud的技术手段及其弊端：

A handful of recent works have started to tackle the task of directly processing large-scale point clouds.

SPG [23] preprocesses the large point clouds as super graphs before applying neural networks to learn per super-point semantics.
Both FCPN [39] and PCT [7] combine voxelization and point-level networks to process massive point clouds. Although they achieve decent segmentation accuracy, the preprocessing and voxelization steps are too computationally heavy to be deployed in real-time applications.

简单概括介绍RandLA-Net技术（优势和要求）
为了克服随机采样带来的弊端(丢失关键语义信息)，提出了高效局部特征聚合模型

In this paper, we aim to design a memory and computationally efficient neural architecture, which is able to directly process large-scale 3D point clouds in a single pass, without requiring any pre/post-processing steps such as voxelization, block partitioning or graph construction. However, this task is extremely challenging as it requires:

a memory and computationally efficient sampling approach to progressively downsample large-scale point clouds to fit in the limits of current GPUs
an effective local feature learner to progressively increase the receptive field size to preserve complex geometric structures.

To this end, we first systematically demonstrate that random sampling is a key enabler for deep neural networks to efficiently process large-scale point clouds. However, random sampling can discard key semantic information, especially for objects with low point densities. To counter the potentially detrimental impact of random sampling, we propose a new and efficient local feature aggregation module to capture complex local structures over progressively smaller point-sets.

常用采样方法：farthest point sampling、 inverse density sampling
评估维度: 计算复杂性、经验测量它们的内存消耗和处理时间
常用随机采样方法的弊端: 局限于small-scale; 然而我们还是想用随机采样的方法处理large-scale，因为：fast and scales efficiently
随机抽样应用于large-scale的弊端：prominent point features may be dropped by chance，解决方案：local feature aggregation module
local feature aggregation module：逐步增加每个神经层的接受域大小。section 3.3 有详细说明

Amongst existing sampling methods, farthest point sampling and inverse density sampling are the most frequently used for small-scale point clouds [38, 54, 29, 64, 15]. As point sampling is such a fundamental step within these networks, we investigate the relative merits of different approaches in Section 3.2, both by examining their computational complexity and empirically by measuring their memory consumption and processing time. From this, we see that the commonly used sampling methods limit scaling towards large point clouds, and act as a significant bottleneck to real-time processing. However, we identify random sampling as by far the most suitable component for large-scale point cloud processing as it is fast and scales efficiently. Random sampling is not without cost, because prominent point features may be dropped by chance and it cannot be used directly in existing networks without incurring a per- formance penalty. To overcome this issue, we design a new local feature aggregation module in Section 3.3, which is capable of effectively learning complex local structures by progressively increasing the receptive field size in each neural layer.

In particular, for each 3D point, we

firstly introduce a local spatial encoding (LocSE) unit to explicitly preserve local geometric structures.
Secondly, we leverage attentive pooling to automatically keep the useful local features.
Thirdly, we stack multiple LocSE units and attentive poolings as a dilated residual block, greatly increasing the effective receptive field for each point.

Note that all these neural components are implemented as shared MLPs, and are therefore remarkably memory and computational efficient.

总结
1.为啥子用随机采样的方法我们评估对比了，它超级适合）
2.局部特征聚合模型的思路：逐步增加每个神经层的接受域大小
3.内存和计算超过基线

Overall, being built on the principles of simple random sampling and an effective local feature aggregator, our efficient neural architecture, named RandLA-Net1, not only is up to 200× faster than existing approaches on large-scale point clouds, but also surpasses the state-of-the-art semantic segmentation methods on both Semantic3D [16] and Se- manticKITTI [3] benchmarks. Figure 1 shows qualitative results of our approach. Our key contributions are:

We analyse and compare existing sampling approaches, identifying random sampling as the most suitable component for efficient learning on large-scale point clouds.
We propose an effective local feature aggregation module to automatically preserve complex local structures by progressively increasing the receptive field for each point.
We demonstrate significant memory and computational gains over baselines, and surpass the state-of-the-art se- mantic segmentation methods on multiple large-scale benchmarks.