【点云处理之论文狂读前沿版4】——Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition

最新推荐文章于 2023-02-10 23:13:49 发布

LingbinBu

最新推荐文章于 2023-02-10 23:13:49 发布

阅读量502

点赞数

分类专栏：点云处理之论文狂读前沿版文章标签： transformer 深度学习自然语言处理

本文链接：https://blog.csdn.net/yuanmiyu6522/article/details/124680156

版权

点云处理之论文狂读前沿版专栏收录该内容

13 篇文章 54 订阅

订阅专栏

摘要

问题：随着点云规模的增大，Transformer(Self-attention)变得很低效；attention mechanism努力在全局范围内寻找单个点之间的有效链接
方案：提出了一种两步方法Point Transformer-in-Transformer (Point-TnT)，该方法将局部和全局的attention mechanisms结合起来，使得单独的点和成块的点能够有效地照应。
应用：Shape classification、Scene reconstruction
代码详见https://github.com/axeber01/point-tnt

2. 引言

提出了一种两步Transformer的结构，在局部和全局范围内结合attention mechanisms
实验证明，与将Self-attention应用到整个点云的方法相比，本文的方法性能更好，计算复杂度更低
在分类任务和重建任务上的表现很好

3. 相关工作

Shape Recognition

volumetric
multi-view
point set

Self-Attention and the Transformer

Transformer is able to learn interactions between elements
permutation-equivariant

Transformer for Point cloud ：

Set transformer: A framework for attention-based permutation-invariant neural networks (2019, Lee)
Point transformer(2021, Zhao)
Pct: Point cloud transformer (2021, Guo)

3. Method

3.1 准备工作

$\in \mathbb{R}^{N \times d}$ 表示点集的矩阵表示形式，包含了 $d$ 维空间中的 $N$ 个特征
$Q =$ $X_{l} W_{Q}, K=X_{l} W_{K}$ 和 $V=X_{l} W_{V}$ 分别表示queries, keys 和 values，其中 $W_{Q}, W_{K}, W_{V} \in \mathbb{R}^{d \times d_{h}}$ 是要学习的参数， $d_{h}$ 是attention-head的维度。
定义self-attention (SA) 为：
$\operatorname{SA}(X)=\operatorname{Softmax}\left(\frac{Q K^{T}}{\sqrt{d_{h}}}\right) V$
其中 softmax函数被用于矩阵的每一行。
SA操作是permutation-equivariant的。定义沿着 $X$ 行排序的操作为 $\pi$ ，有 $\mathrm{SA}(\pi X)=\pi \mathrm{SA}(X)$ 。
当有多个SA运算时，每一个SA operator都有着自己的可学习权重，并行运算，最后在每一列上进行拼接。Multi-headed SA (MSA)：
$\operatorname{MSA}(X)=\left[\mathrm{SA}_{1}(X) ; \mathrm{SA}_{2}(X) ; \ldots ; \mathrm{SA}_{h}(X)\right] W_{P}$
其中 $;$ 表示逐列拼接， $W_{P} \in \mathbb{R}^{h d_{h} \times d}$ 是要学习的参数， $h$ 是attention head的数量。
定义Transformer layer $T_{\theta}$ 为：
$\begin{aligned} \tilde{X} &=\operatorname{MSA}(\operatorname{LN}(X))+X, \\ T_{\theta}(X) &=\operatorname{MLP}(\operatorname{LN}(\tilde{X}))+\tilde{X} \end{aligned}$
其中 $\text{MLP}$ 包含了一个隐藏层和一个GELU激活函数， $\text{LN}$ 是一个LayerNorm操作， $\theta$ 为训练参数。这里将LayerNorm 放在MSA and MLP前面。
为了通过permutation-invariant的方式聚合特征，计算所有特征的最大值和平均值，通过列的方式拼接：
$\alpha(X)=\left[\max _{i} X_{i} ; \frac{1}{N} \sum_{i} X_{i}\right], \quad i=1, \ldots, N$
其中 $X_{i}$ 是 $X$ 的一行。

3.2 Point Transformer-in-Transformer

$\mathcal{X}={\left\{ x_i\right\}}^{N}_{i=1}$ 表示在三维空间中的 $N$ 个点
计算k-nearest neighbour (k-NN) graph ， $\mathcal{G}=(\mathcal{V}, \mathcal{E})$ ， $\mathcal{V}=\{i\}_{i=1}^{N}$ 表示顶点；如果 $x_{j}$ 是 $x_{i}$ 的一个k-nearest neighbours，那么 $\mathcal{E} \subset \mathcal{V} \times \mathcal{V}$ 表示从 $i$ 到 $j$ 的有向边
用FPS采样出 $M$ 个anchor points $\mathcal{Y}=\left\{ x_m\right\}_{m \in \mathcal{V}^{'}}$ ，从他们的neighbours那里聚合特征
提取每个anchor point 和 neighbours point的边特征：
$E^{m}=\left[e_{m 1}, \ldots, e_{m k}\right]^{T}$
where $e_{m j}=x_{j}-x_{m}, \quad(m, j) \in \mathcal{E}^{\prime}$
令 $Y=\left[x_{1}, \ldots, x_{M}\right]^{T} \in \mathbb{R}^{M \times 3}$ 表示anchor point的矩阵，和边特征一起映射到高维：
$Y_{0}=Y W_{Y} \\ E_{0}^{m}=E^{m} W_{E}$
其中 $W_{Y} \in \mathbb{R}^{3 \times d_{Y}}$ ， $W_{E} \in \mathbb{R}^{3 \times d_{E}}$
anchor features和edge features被分别输入到两个L连续的Transformer blocks。
局部分支计算edge features间的self-attention：
$E_{l}^{m}=T_{\theta_{l}}^{\text {local }}\left(E_{l-1}^{m}\right), \quad l=1, \ldots, L$
在每一个局部Transformer layer后，特征的邻域都会被聚合和拼接成一个新的矩阵 $E_{l}=\left[\alpha\left(E_{l}^{1}\right), \ldots, \alpha\left(E_{l}^{M}\right)\right] \in \mathbb{R}^{M \times 2 d_{E}}$ , 然后使用另外的线性映射的方法加到每个anchor point上：
$\tilde{Y}_{l-1}=Y_{l-1}+E_{l} W_{l}, \quad l=1, \ldots, L$
其中 $W_{l} \in \mathbb{R}^{2 d_{E} \times d_{Y}}$
全局分支计算patch间的attention：
$Y_{l}=T_{\theta_{l}}^{\text {global }}\left(\tilde{Y}_{l-1}\right), \quad l=1, \ldots, L .$
为了利用中间的特征表示，将每一层的anchor point特征进行拼接，利用单层MLP和聚合生成一个点云的全局特征：
$Z=\alpha\left(\operatorname{MLP}\left(\left[Y_{1} ; \ldots ; Y_{L}\right]\right)\right),$
其中 $\in \mathbb{R}^{2 d_{f}}$ ， $d_{f}$ 是全局特征的嵌入维度。该全局特征可以被用于下游任务。

3.3 Computational Analysis

尽管native transformer的Self-attention实现需要 $\mathcal{O}\left(N^{2}\right)$ 的复杂度，但是将attention分成local和global两个分支将会大大降低复杂度。

我们方法在局部transformer上的复杂度为 $\mathcal{O}\left(M k^{2}\right)$ ，在全局transformer上的复杂度为 $\mathcal{O}\left(M^{2}\right)$ 。在限制neighbours和anchor point数量下很容易满足 $M k^{2}+M^{2}<N^{2}$ 。

4. 实验

4.1 Shape Classification

Dataset = ScanObjectNN(PB T50 RS)

contains 14,510 real-world 3D objects in 15 categories
train/test split = 80/20
Anchor points $M = 192$
Neighbours $k = 20$
Embedding dimensions $d_Y=192$ , $d_E=32$
global feature embedding dimension $d_f=1024$
Transformer blocks $L = 7$
Attention heads $h = 3$
Final MLP : two hidden layers of size 512 and 256 respectively, with batch normalization and dropout applied in each layer
Loss = the standard cross-entropy loss function
Epochs = 500
Batch_size = 32
Optimizer = AdamW
Weight decay = 0.1
Initial learning rate = 0.001, decreased by cosine schedule
Data augmentation: RSMix in addition to random anisotropic scaling and shifting

Baseline：使用所有点作为anchor points，没有neighbours，即舍弃掉网络的局部分支。
Protocol 1：使用训练的最后一个epoch模型对测试集进行评估
Protocol 2：每训练一个epoch都在训练集上跑以下，最后报告最好的测试精度
local geometric properties are important

Dataset = ModelNet40

contains 9,843 synthetic shapes for training and 2,468 for testing, in 40 different categories
obtain 92.6±0.2 % and 93.2±0.2 % accuracy using Protocol1 and 2 respectively

4.2 Ablation Study

Model scaling

Attention mechanisms

global attention ——> stabilize training

Number of anchors and neighbours

4.3 Feature Matching on 3DMatch

3DMatch dataset consists of 62 indoor scenes collected from RGB-D measurements.

Feature matching task：生成场景局部patches的描述子，用于匹配有重合区域的场景。

给定一对点云场景 $(\mathcal{X}, \mathcal{X^{'}})$ ，这两个场景至少有30%的重叠，通过提取局部特征并使用nearest neighbour search方法进行匹配来找到对应的点。通过使用类似RANSAC等方法计算刚体变换，进而对两个场景进行配准。

train\test = 54\8
Loss = hardest contrastive loss
Epochs = 10
Optimizer = AdamW
Weight decay = 0.1
Initial learning rate =0.001, decreased by cosine schedule
local patches $N = 256$
anchor points $M = 48$
neighbours $k = 10$

5. 结论

对全部的点云使用self-attention性能和复杂度都不太好。
若添加local patches of points，性能就能提升
Transformer architecture has shown to work better on local image patches rather than individual pixels.
makes feature extraction more computationally tractable