论文阅读12：Spatio-Temporal Tuples Transformer forSkeleton-Based Action Recognition-2021STTFormer

最新推荐文章于 2022-12-07 10:27:20 发布

梅津太郎

最新推荐文章于 2022-12-07 10:27:20 发布

阅读量312

点赞数

分类专栏：论文阅读文章标签：论文阅读 transformer 深度学习

本文链接：https://blog.csdn.net/gaocui883/article/details/128190674

版权

30 篇文章 1 订阅

订阅专栏

problems currently : the existing Transformer-based meth-ods cannot capture the correlation of different joints between frames
The skeleton sequence is divided into several parts, and several consecu-tive frames contained in each part are encoded. And then a spatio-temporal tuplesself-attention module is proposed to capture the relationship of different joints inconsecutive frames. In addition, a feature aggregation module is introduced be-tween non-adjacent frames to enhance the ability to distinguish similar actions.

the three points of this paper.

the above methods(RNN,CNN,GCN) cannot effectively model the long-term dependence of sequences and the global correlation of spatio-temporal joints. ~~是否可以使用例如non-local, slow-fast 之类的策略来进行skeleton based action recognition.~~
本文重点：extract the related features of different joints between adjacent frames

这算是一种解释，而非出发点，感觉。
两个模块：
- STTA - spatio-temporal tuple self-attention. 进行局部连续帧的注意力。
- IFFA - 非连续帧间的注意力，信息整合。

Overall Architecture
Spatio-Temporal Tuples Encoding
- $\in \mathbb{R}^{C_0 \times T_0 \times V_0}$
- $c o n v 1 = c o n v 1 + b a t h N o r m + L e a k y R e L U$ .
  
  $ X$ shape : $C_0,T_0,V_0 \rightarrow C_1,T_0,V_0$
- skeleton sequence divide : $X.reshape(C_1,T,n,V_0), T_0 = n\times T_0)$ 这里， $T\times n$ 表示，将原来序列分成 $T$ 段，每一段是原始序列中连续的 $n$ 个序列。
- flatten : $\gets X.reshape(C1,T,n*V_0)$
- $c o n v 2 = c o n v o l u t i o n a l l a y e r + L e a k y R e l u ()$
Positional Encoding

$\begin{array}{l} P E(p, 2 i)=\sin \left(p / 10000^{2 i / C_{i n}}\right) \\ P E(p, 2 i+1)=\cos \left(p / 10000^{2 i / C_{i n}}\right) \end{array}$

其中 $p$ 表示位置， $i$ 表示维度
Spatio-Temporal Tuples Transformer
- Spatio-Temporal Tuples Attention
1. 求Q,K,V
  
  $\mathbf{Q}, \mathbf{K}, \mathbf{V}=\operatorname{Conv}*{2 D(1 \times 1)}\left(\mathbf{X}*{i n}\right)$
2. 求X,多头注意力
  - $\mathbf{X}_{a t t n}=\operatorname{Tanh}\left(\frac{\mathbf{Q K}^{\mathbf{T}}}{\sqrt{C}}\right) \mathbf{V}$
  - $\mathbf{X}_{\text {Attn }}=\operatorname{Concat}\left(\mathbf{X}_{a t t n}^{1}, \cdots, \mathbf{X}_{\text {attn }}^{h}\right)$
  - $\mathbf{X}_{S T T A}=\operatorname{Conv}_{2 D\left(1 \times k_{1}\right)}\left(\mathbf{X}_{A t t n}\right)$
3. 在feed forword 前进行残差，残差为 $conv1\times1$

。。。。

Aggregation $T$ sub actions.

当然残差链接依然是需要的。

。。。。

关注