【点云处理之论文狂读前沿版4】——Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition


  1. 问题:随着点云规模的增大,Transformer(Self-attention)变得很低效;attention mechanism努力在全局范围内寻找单个点之间的有效链接
  2. 方案:提出了一种两步方法Point Transformer-in-Transformer (Point-TnT),该方法将局部和全局的attention mechanisms结合起来,使得单独的点和成块的点能够有效地照应。
  3. 应用:Shape classification、Scene reconstruction
  4. 代码详见https://github.com/axeber01/point-tnt

2. 引言

  1. 提出了一种两步Transformer的结构,在局部和全局范围内结合attention mechanisms
  2. 实验证明,与将Self-attention应用到整个点云的方法相比,本文的方法性能更好,计算复杂度更低
  3. 在分类任务和重建任务上的表现很好

3. 相关工作

Shape Recognition

  1. volumetric
  2. multi-view
  3. point set

Self-Attention and the Transformer

  1. Transformer is able to learn interactions between elements
  2. permutation-equivariant

Transformer for Point cloud :

  • Set transformer: A framework for attention-based permutation-invariant neural networks (2019, Lee)
  • Point transformer(2021, Zhao)
  • Pct: Point cloud transformer (2021, Guo)

3. Method

3.1 准备工作

  • X ∈ R N × d X \in \mathbb{R}^{N \times d} XRN×d表示点集的矩阵表示形式,包含了 d d d维空间中的 N N N个特征
  • Q = Q= Q= X l W Q , K = X l W K X_{l} W_{Q}, K=X_{l} W_{K} XlWQ,K=XlWK V = X l W V V=X_{l} W_{V} V=XlWV分别表示queries, keys 和 values,其中 W Q , W K , W V ∈ R d × d h W_{Q}, W_{K}, W_{V} \in \mathbb{R}^{d \times d_{h}} WQ,WK,WVRd×dh是要学习的参数, d h d_{h} dh是attention-head的维度。
  • 定义self-attention (SA) 为:
    SA ⁡ ( X ) = Softmax ⁡ ( Q K T d h ) V \operatorname{SA}(X)=\operatorname{Softmax}\left(\frac{Q K^{T}}{\sqrt{d_{h}}}\right) V SA(X)=Softmax(dh QKT)V
    其中 softmax函数被用于矩阵的每一行。
  • SA操作是permutation-equivariant的。定义沿着 X X X行排序的操作为 π \pi π,有 S A ( π X ) = π S A ( X ) \mathrm{SA}(\pi X)=\pi \mathrm{SA}(X) SA(πX)=πSA(X)
  • 当有多个SA运算时,每一个SA operator都有着自己的可学习权重,并行运算,最后在每一列上进行拼接。Multi-headed SA (MSA):
    MSA ⁡ ( X ) = [ S A 1 ( X ) ; S A 2 ( X ) ; … ; S A h ( X ) ] W P \operatorname{MSA}(X)=\left[\mathrm{SA}_{1}(X) ; \mathrm{SA}_{2}(X) ; \ldots ; \mathrm{SA}_{h}(X)\right] W_{P} MSA(X)=[SA1(X);SA2(X);;SAh(X)]WP
    其中 ; ; ;表示逐列拼接, W P ∈ R h d h × d W_{P} \in \mathbb{R}^{h d_{h} \times d} WPRhdh×d是要学习的参数, h h h 是attention head的数量。
  • 定义Transformer layer T θ T_{\theta} Tθ为:
    X ~ = MSA ⁡ ( LN ⁡ ( X ) ) + X , T θ ( X ) = MLP ⁡ ( LN ⁡ ( X ~ ) ) + X ~ \begin{aligned} \tilde{X} &=\operatorname{MSA}(\operatorname{LN}(X))+X, \\ T_{\theta}(X) &=\operatorname{MLP}(\operatorname{LN}(\tilde{X}))+\tilde{X} \end{aligned} X~Tθ(X)=MSA(LN(X))+X,=MLP(LN(X~))+X~
    其中 MLP \text{MLP} MLP包含了一个隐藏层和一个GELU激活函数, LN \text{LN} LN是一个LayerNorm操作, θ \theta θ 为训练参数。这里将LayerNorm 放在MSA and MLP前面。
  • 为了通过permutation-invariant的方式聚合特征,计算所有特征的最大值和平均值,通过列的方式拼接:
    α ( X ) = [ max ⁡ i X i ; 1 N ∑ i X i ] , i = 1 , … , N \alpha(X)=\left[\max _{i} X_{i} ; \frac{1}{N} \sum_{i} X_{i}\right], \quad i=1, \ldots, N α(X)=[imaxXi;N1iXi],i=1,,N
    其中 X i X_{i} Xi X X X的一行。

3.2 Point Transformer-in-Transformer

  • X = { x i } i = 1 N \mathcal{X}={\left\{ x_i\right\}}^{N}_{i=1} X={xi}i=1N表示在三维空间中的 N N N个点
  • 计算k-nearest neighbour (k-NN) graph , G = ( V , E ) \mathcal{G}=(\mathcal{V}, \mathcal{E}) G=(V,E) V = { i } i = 1 N \mathcal{V}=\{i\}_{i=1}^{N} V={i}i=1N表示顶点;如果 x j x_{j} xj x i x_{i} xi的一个k-nearest neighbours,那么 E ⊂ V × V \mathcal{E} \subset \mathcal{V} \times \mathcal{V} EV×V表示从 i i i j j j的有向边
  • 用FPS采样出 M M M个anchor points Y = { x m } m ∈ V ′ \mathcal{Y}=\left\{ x_m\right\}_{m \in \mathcal{V}^{'}} Y={xm}mV,从他们的neighbours那里聚合特征
  • 提取每个anchor point 和 neighbours point的边特征:
    E m = [ e m 1 , … , e m k ] T E^{m}=\left[e_{m 1}, \ldots, e_{m k}\right]^{T} Em=[em1,,emk]T
    where e m j = x j − x m , ( m , j ) ∈ E ′ e_{m j}=x_{j}-x_{m}, \quad(m, j) \in \mathcal{E}^{\prime} emj=xjxm,(m,j)E
  • Y = [ x 1 , … , x M ] T ∈ R M × 3 Y=\left[x_{1}, \ldots, x_{M}\right]^{T} \in \mathbb{R}^{M \times 3} Y=[x1,,xM]TRM×3表示anchor point的矩阵,和边特征一起映射到高维:
    Y 0 = Y W Y E 0 m = E m W E Y_{0}=Y W_{Y} \\ E_{0}^{m}=E^{m} W_{E} Y0=YWYE0m=EmWE
    其中 W Y ∈ R 3 × d Y W_{Y} \in \mathbb{R}^{3 \times d_{Y}} WYR3×dY W E ∈ R 3 × d E W_{E} \in \mathbb{R}^{3 \times d_{E}} WER3×dE
  • anchor features和edge features被分别输入到两个L连续的Transformer blocks。
  • 局部分支计算edge features间的self-attention:
    E l m = T θ l local  ( E l − 1 m ) , l = 1 , … , L E_{l}^{m}=T_{\theta_{l}}^{\text {local }}\left(E_{l-1}^{m}\right), \quad l=1, \ldots, L Elm=Tθllocal (El1m),l=1,,L
    在每一个局部Transformer layer后,特征的邻域都会被聚合和拼接成一个新的矩阵 E l = [ α ( E l 1 ) , … , α ( E l M ) ] ∈ R M × 2 d E E_{l}=\left[\alpha\left(E_{l}^{1}\right), \ldots, \alpha\left(E_{l}^{M}\right)\right] \in \mathbb{R}^{M \times 2 d_{E}} El=[α(El1),,α(ElM)]RM×2dE, 然后使用另外的线性映射的方法加到每个anchor point上:
    Y ~ l − 1 = Y l − 1 + E l W l , l = 1 , … , L \tilde{Y}_{l-1}=Y_{l-1}+E_{l} W_{l}, \quad l=1, \ldots, L Y~l1=Yl1+ElWl,l=1,,L
    其中 W l ∈ R 2 d E × d Y W_{l} \in \mathbb{R}^{2 d_{E} \times d_{Y}} WlR2dE×dY
  • 全局分支计算patch间的attention:
    Y l = T θ l global  ( Y ~ l − 1 ) , l = 1 , … , L . Y_{l}=T_{\theta_{l}}^{\text {global }}\left(\tilde{Y}_{l-1}\right), \quad l=1, \ldots, L . Yl=Tθlglobal (Y~l1),l=1,,L.
  • 为了利用中间的特征表示,将每一层的anchor point特征进行拼接,利用单层MLP和聚合生成一个点云的全局特征:
    Z = α ( MLP ⁡ ( [ Y 1 ; … ; Y L ] ) ) , Z=\alpha\left(\operatorname{MLP}\left(\left[Y_{1} ; \ldots ; Y_{L}\right]\right)\right), Z=α(MLP([Y1;;YL])),
    其中 Z ∈ R 2 d f Z \in \mathbb{R}^{2 d_{f}} ZR2df d f d_{f} df是全局特征的嵌入维度。该全局特征可以被用于下游任务。

3.3 Computational Analysis

尽管native transformer的Self-attention实现需要 O ( N 2 ) \mathcal{O}\left(N^{2}\right) O(N2) 的复杂度,但是将attention分成local和global两个分支将会大大降低复杂度。

我们方法在局部transformer上的复杂度为 O ( M k 2 ) \mathcal{O}\left(M k^{2}\right) O(Mk2) ,在全局transformer上的复杂度为 O ( M 2 ) \mathcal{O}\left(M^{2}\right) O(M2) 。在限制neighbours和anchor point数量下很容易满足 M k 2 + M 2 < N 2 M k^{2}+M^{2}<N^{2} Mk2+M2<N2

4. 实验

4.1 Shape Classification

Dataset = ScanObjectNN(PB T50 RS)

  • contains 14,510 real-world 3D objects in 15 categories
  • train/test split = 80/20
  • Anchor points M = 192 M = 192 M=192
  • Neighbours k = 20 k=20 k=20
  • Embedding dimensions d Y = 192 d_Y=192 dY=192, d E = 32 d_E=32 dE=32
  • global feature embedding dimension d f = 1024 d_f=1024 df=1024
  • Transformer blocks L = 7 L=7 L=7
  • Attention heads h = 3 h=3 h=3
  • Final MLP : two hidden layers of size 512 and 256 respectively, with batch normalization and dropout applied in each layer
  • Loss = the standard cross-entropy loss function
  • Epochs = 500
  • Batch_size = 32
  • Optimizer = AdamW
  • Weight decay = 0.1
  • Initial learning rate = 0.001, decreased by cosine schedule
  • Data augmentation: RSMix in addition to random anisotropic scaling and shifting
  • Baseline:使用所有点作为anchor points,没有neighbours,即舍弃掉网络的局部分支。
  • Protocol 1:使用训练的最后一个epoch模型对测试集进行评估
  • Protocol 2:每训练一个epoch都在训练集上跑以下,最后报告最好的测试精度
  • local geometric properties are important

Dataset = ModelNet40

  • contains 9,843 synthetic shapes for training and 2,468 for testing, in 40 different categories

  • obtain 92.6±0.2 % and 93.2±0.2 % accuracy using Protocol1 and 2 respectively

4.2 Ablation Study

Model scaling

Attention mechanisms

global attention ——> stabilize training

Number of anchors and neighbours

4.3 Feature Matching on 3DMatch

3DMatch dataset consists of 62 indoor scenes collected from RGB-D measurements.

Feature matching task:生成场景局部patches的描述子,用于匹配有重合区域的场景。

给定一对点云场景 ( X , X ′ ) (\mathcal{X}, \mathcal{X^{'}}) (X,X),这两个场景至少有30%的重叠,通过提取局部特征并使用nearest neighbour search方法进行匹配来找到对应的点。通过使用类似RANSAC等方法计算刚体变换,进而对两个场景进行配准。

  • train\test = 54\8
  • Loss = hardest contrastive loss
  • Epochs = 10
  • Optimizer = AdamW
  • Weight decay = 0.1
  • Initial learning rate =0.001, decreased by cosine schedule
  • local patches N = 256 N=256 N=256
  • anchor points M = 48 M = 48 M=48
  • neighbours k = 10 k = 10 k=10

5. 结论

  1. 对全部的点云使用self-attention性能和复杂度都不太好。
  2. 若添加local patches of points,性能就能提升
  3. Transformer architecture has shown to work better on local image patches rather than individual pixels.
  4. makes feature extraction more computationally tractable




