Point-TnT:用于3D形状识别的Self-attenrion方法
摘要
- 问题:随着点云规模的增大,Transformer(Self-attention)变得很低效;attention mechanism努力在全局范围内寻找单个点之间的有效链接
- 方案:提出了一种两步方法Point Transformer-in-Transformer (Point-TnT),该方法将局部和全局的attention mechanisms结合起来,使得单独的点和成块的点能够有效地照应。
- 应用:Shape classification、Scene reconstruction
- 代码详见https://github.com/axeber01/point-tnt
2. 引言
- 提出了一种两步Transformer的结构,在局部和全局范围内结合attention mechanisms
- 实验证明,与将Self-attention应用到整个点云的方法相比,本文的方法性能更好,计算复杂度更低
- 在分类任务和重建任务上的表现很好
3. 相关工作
Shape Recognition
- volumetric
- multi-view
- point set
Self-Attention and the Transformer
- Transformer is able to learn interactions between elements
- permutation-equivariant
Transformer for Point cloud :
- Set transformer: A framework for attention-based permutation-invariant neural networks (2019, Lee)
- Point transformer(2021, Zhao)
- Pct: Point cloud transformer (2021, Guo)
3. Method
3.1 准备工作
- X ∈ R N × d X \in \mathbb{R}^{N \times d} X∈RN×d表示点集的矩阵表示形式,包含了 d d d维空间中的 N N N个特征
- Q = Q= Q= X l W Q , K = X l W K X_{l} W_{Q}, K=X_{l} W_{K} XlWQ,K=XlWK 和 V = X l W V V=X_{l} W_{V} V=XlWV分别表示queries, keys 和 values,其中 W Q , W K , W V ∈ R d × d h W_{Q}, W_{K}, W_{V} \in \mathbb{R}^{d \times d_{h}} WQ,WK,WV∈Rd×dh是要学习的参数, d h d_{h} dh是attention-head的维度。
- 定义self-attention (SA) 为:
SA ( X ) = Softmax ( Q K T d h ) V \operatorname{SA}(X)=\operatorname{Softmax}\left(\frac{Q K^{T}}{\sqrt{d_{h}}}\right) V SA(X)=Softmax(dhQKT)V
其中 softmax函数被用于矩阵的每一行。 - SA操作是permutation-equivariant的。定义沿着 X X X行排序的操作为 π \pi π,有 S A ( π X ) = π S A ( X ) \mathrm{SA}(\pi X)=\pi \mathrm{SA}(X) SA(πX)=πSA(X)。
- 当有多个SA运算时,每一个SA operator都有着自己的可学习权重,并行运算,最后在每一列上进行拼接。Multi-headed SA (MSA):
MSA ( X ) = [ S A 1 ( X ) ; S A 2 ( X ) ; … ; S A h ( X ) ] W P \operatorname{MSA}(X)=\left[\mathrm{SA}_{1}(X) ; \mathrm{SA}_{2}(X) ; \ldots ; \mathrm{SA}_{h}(X)\right] W_{P} MSA(X)=[SA1(X);SA2(X);…;SAh(X)]WP
其中 ; ; ;表示逐列拼接, W P ∈ R h d h × d W_{P} \in \mathbb{R}^{h d_{h} \times d} WP∈Rhdh×d是要学习的参数, h h h 是attention head的数量。 - 定义Transformer layer
T
θ
T_{\theta}
Tθ为:
X ~ = MSA ( LN ( X ) ) + X , T θ ( X ) = MLP ( LN ( X ~ ) ) + X ~ \begin{aligned} \tilde{X} &=\operatorname{MSA}(\operatorname{LN}(X))+X, \\ T_{\theta}(X) &=\operatorname{MLP}(\operatorname{LN}(\tilde{X}))+\tilde{X} \end{aligned} X~Tθ(X)=MSA(LN(X))+X,=MLP(LN(X~))+X~
其中 MLP \text{MLP} MLP包含了一个隐藏层和一个GELU激活函数, LN \text{LN} LN是一个LayerNorm操作, θ \theta θ 为训练参数。这里将LayerNorm 放在MSA and MLP前面。 - 为了通过permutation-invariant的方式聚合特征,计算所有特征的最大值和平均值,通过列的方式拼接:
α ( X ) = [ max i X i ; 1 N ∑ i X i ] , i = 1 , … , N \alpha(X)=\left[\max _{i} X_{i} ; \frac{1}{N} \sum_{i} X_{i}\right], \quad i=1, \ldots, N α(X)=[imaxXi;N1i∑Xi],i=1,…,N
其中 X i X_{i} Xi是 X X X的一行。
3.2 Point Transformer-in-Transformer
- X = { x i } i = 1 N \mathcal{X}={\left\{ x_i\right\}}^{N}_{i=1} X={xi}i=1N表示在三维空间中的 N N N个点
- 计算k-nearest neighbour (k-NN) graph , G = ( V , E ) \mathcal{G}=(\mathcal{V}, \mathcal{E}) G=(V,E), V = { i } i = 1 N \mathcal{V}=\{i\}_{i=1}^{N} V={i}i=1N表示顶点;如果 x j x_{j} xj是 x i x_{i} xi的一个k-nearest neighbours,那么 E ⊂ V × V \mathcal{E} \subset \mathcal{V} \times \mathcal{V} E⊂V×V表示从 i i i 到 j j j的有向边
- 用FPS采样出 M M M个anchor points Y = { x m } m ∈ V ′ \mathcal{Y}=\left\{ x_m\right\}_{m \in \mathcal{V}^{'}} Y={xm}m∈V′,从他们的neighbours那里聚合特征
- 提取每个anchor point 和 neighbours point的边特征:
E m = [ e m 1 , … , e m k ] T E^{m}=\left[e_{m 1}, \ldots, e_{m k}\right]^{T} Em=[em1,…,emk]T
where e m j = x j − x m , ( m , j ) ∈ E ′ e_{m j}=x_{j}-x_{m}, \quad(m, j) \in \mathcal{E}^{\prime} emj=xj−xm,(m,j)∈E′ - 令
Y
=
[
x
1
,
…
,
x
M
]
T
∈
R
M
×
3
Y=\left[x_{1}, \ldots, x_{M}\right]^{T} \in \mathbb{R}^{M \times 3}
Y=[x1,…,xM]T∈RM×3表示anchor point的矩阵,和边特征一起映射到高维:
Y 0 = Y W Y E 0 m = E m W E Y_{0}=Y W_{Y} \\ E_{0}^{m}=E^{m} W_{E} Y0=YWYE0m=EmWE
其中 W Y ∈ R 3 × d Y W_{Y} \in \mathbb{R}^{3 \times d_{Y}} WY∈R3×dY , W E ∈ R 3 × d E W_{E} \in \mathbb{R}^{3 \times d_{E}} WE∈R3×dE - anchor features和edge features被分别输入到两个L连续的Transformer blocks。
- 局部分支计算edge features间的self-attention:
E l m = T θ l local ( E l − 1 m ) , l = 1 , … , L E_{l}^{m}=T_{\theta_{l}}^{\text {local }}\left(E_{l-1}^{m}\right), \quad l=1, \ldots, L Elm=Tθllocal (El−1m),l=1,…,L
在每一个局部Transformer layer后,特征的邻域都会被聚合和拼接成一个新的矩阵 E l = [ α ( E l 1 ) , … , α ( E l M ) ] ∈ R M × 2 d E E_{l}=\left[\alpha\left(E_{l}^{1}\right), \ldots, \alpha\left(E_{l}^{M}\right)\right] \in \mathbb{R}^{M \times 2 d_{E}} El=[α(El1),…,α(ElM)]∈RM×2dE, 然后使用另外的线性映射的方法加到每个anchor point上:
Y ~ l − 1 = Y l − 1 + E l W l , l = 1 , … , L \tilde{Y}_{l-1}=Y_{l-1}+E_{l} W_{l}, \quad l=1, \ldots, L Y~l−1=Yl−1+ElWl,l=1,…,L
其中 W l ∈ R 2 d E × d Y W_{l} \in \mathbb{R}^{2 d_{E} \times d_{Y}} Wl∈R2dE×dY - 全局分支计算patch间的attention:
Y l = T θ l global ( Y ~ l − 1 ) , l = 1 , … , L . Y_{l}=T_{\theta_{l}}^{\text {global }}\left(\tilde{Y}_{l-1}\right), \quad l=1, \ldots, L . Yl=Tθlglobal (Y~l−1),l=1,…,L. - 为了利用中间的特征表示,将每一层的anchor point特征进行拼接,利用单层MLP和聚合生成一个点云的全局特征:
Z = α ( MLP ( [ Y 1 ; … ; Y L ] ) ) , Z=\alpha\left(\operatorname{MLP}\left(\left[Y_{1} ; \ldots ; Y_{L}\right]\right)\right), Z=α(MLP([Y1;…;YL])),
其中 Z ∈ R 2 d f Z \in \mathbb{R}^{2 d_{f}} Z∈R2df, d f d_{f} df是全局特征的嵌入维度。该全局特征可以被用于下游任务。
3.3 Computational Analysis
尽管native transformer的Self-attention实现需要 O ( N 2 ) \mathcal{O}\left(N^{2}\right) O(N2) 的复杂度,但是将attention分成local和global两个分支将会大大降低复杂度。
我们方法在局部transformer上的复杂度为 O ( M k 2 ) \mathcal{O}\left(M k^{2}\right) O(Mk2) ,在全局transformer上的复杂度为 O ( M 2 ) \mathcal{O}\left(M^{2}\right) O(M2) 。在限制neighbours和anchor point数量下很容易满足 M k 2 + M 2 < N 2 M k^{2}+M^{2}<N^{2} Mk2+M2<N2。
4. 实验
4.1 Shape Classification
Dataset = ScanObjectNN(PB T50 RS)
- contains 14,510 real-world 3D objects in 15 categories
- train/test split = 80/20
- Anchor points M = 192 M = 192 M=192
- Neighbours k = 20 k=20 k=20
- Embedding dimensions d Y = 192 d_Y=192 dY=192, d E = 32 d_E=32 dE=32
- global feature embedding dimension d f = 1024 d_f=1024 df=1024
- Transformer blocks L = 7 L=7 L=7
- Attention heads h = 3 h=3 h=3
- Final MLP : two hidden layers of size 512 and 256 respectively, with batch normalization and dropout applied in each layer
- Loss = the standard cross-entropy loss function
- Epochs = 500
- Batch_size = 32
- Optimizer = AdamW
- Weight decay = 0.1
- Initial learning rate = 0.001, decreased by cosine schedule
- Data augmentation: RSMix in addition to random anisotropic scaling and shifting
- Baseline:使用所有点作为anchor points,没有neighbours,即舍弃掉网络的局部分支。
- Protocol 1:使用训练的最后一个epoch模型对测试集进行评估
- Protocol 2:每训练一个epoch都在训练集上跑以下,最后报告最好的测试精度
- local geometric properties are important
Dataset = ModelNet40
-
contains 9,843 synthetic shapes for training and 2,468 for testing, in 40 different categories
-
obtain 92.6±0.2 % and 93.2±0.2 % accuracy using Protocol1 and 2 respectively
4.2 Ablation Study
Model scaling
Attention mechanisms
global attention ——> stabilize training
Number of anchors and neighbours
4.3 Feature Matching on 3DMatch
3DMatch dataset consists of 62 indoor scenes collected from RGB-D measurements.
Feature matching task:生成场景局部patches的描述子,用于匹配有重合区域的场景。
给定一对点云场景 ( X , X ′ ) (\mathcal{X}, \mathcal{X^{'}}) (X,X′),这两个场景至少有30%的重叠,通过提取局部特征并使用nearest neighbour search方法进行匹配来找到对应的点。通过使用类似RANSAC等方法计算刚体变换,进而对两个场景进行配准。
- train\test = 54\8
- Loss = hardest contrastive loss
- Epochs = 10
- Optimizer = AdamW
- Weight decay = 0.1
- Initial learning rate =0.001, decreased by cosine schedule
- local patches N = 256 N=256 N=256
- anchor points M = 48 M = 48 M=48
- neighbours k = 10 k = 10 k=10
5. 结论
- 对全部的点云使用self-attention性能和复杂度都不太好。
- 若添加local patches of points,性能就能提升
- Transformer architecture has shown to work better on local image patches rather than individual pixels.
- makes feature extraction more computationally tractable