3. Model Architecture
3.1. Graph Convolutional Neural Network
Graph convolutional neural network (GCN) is a general and effective framework for learning representation of graph-structured data. Various GCN variants have achieved state-of-the-art results on many tasks. For skeleton-based action recognition, let G t = { V t , E t } G_t = \{ V_t, E_t \} Gt={Vt,Et} denotes a graph of human skeleton on a single frame at time t t t, where V t V_t Vt is the set of N N N joint nodes and E t E_t Et is the set of skeleton edges. The neighbor set of a node v t i v_{ti} vti is defined as:
N ( v t i ) = { v t j ∣ d ( v t i , v t j ) ≤ D } , N(v_{ti}) = \{ v_{tj} | d(v_{ti}, v_{tj}) \leq D \}, N(vti)={vtj∣d(vti,vtj)≤D},
where d ( v t i , v t j ) d(v_{ti}, v_{tj}) d(vti,vtj) is the minimum path length from v t j v_{tj} vtj to v t i v_{ti} vti. A graph labeling function ℓ : V t → { 1 , 2 , . . . , K } \ell : V_t \rightarrow \{ 1, 2, ..., K \} ℓ:Vt→{1,2,...,K} is designed to assign the labels { 1 , 2 , . . . , K } \{ 1, 2, ..., K \} {1,2,...,K} to each graph node v t i ∈ V t v_{ti} \in V_t vti∈Vt, which can partition the neighbor set N ( v t i ) N(v_{ti}) N(vti) of node v t i v_{ti} vti into a fixed number of K K K subsets. The graph convolution is generally computed as:
Y out ( v t i ) = ∑ v t j ∈ N ( v t i ) 1 Z t i ( v t j ) X ( v t j ) W ( ℓ ( v t j ) ) Y_{\text{out}}(v_{ti}) = \sum_{v_{tj} \in N(v_{ti})} \frac{1}{Z_{ti}(v_{tj})} X(v_{tj}) W(\ell(v_{tj})) Yout(vti)=vtj∈N(vti)∑Zti(vtj)1X(vtj)W(ℓ(vtj))
where X ( v t j ) X(v_{tj}) X(vtj) is the feature of node v t j v_{tj} vtj. W ( ⋅ ) W(\cdot) W(⋅) is a weight function that allocates a weight indexed by the label ℓ ( v t j ) \ell(v_{tj}) ℓ(vtj) from K K K weights. Z t i ( v t j ) Z_{ti}(v_{tj}) Zti(vtj) is the number of the corresponding subset, which normalizes feature representations. Y out ( v t i ) Y_{\text{out}}(v_{ti}) Yout(vti) denotes the output of graph convolution at node v t i v_{ti} vti. More specifically, with the adjacency matrix, the Eqn. 1 can be represented as:
Y out = ∑ k = 1 K Λ k − 1 2 A k Λ k − 1 2 X W k Y_{\text{out}} = \sum_{k=1}^{K} \Lambda_k^{- \frac{1}{2}} A_k \Lambda_k^{- \frac{1}{2}} X W_k Yout=k=1∑KΛk−21AkΛk−21XWk
where A k A_k Ak is the adjacency matrix in spatial configuration of the label k ∈ { 1 , 2 , . . . , K } k \in \{ 1, 2, ..., K \} k∈{1,2,...,K}. Λ k i i = ∑ j A k i j \Lambda_k^{ii} = \sum_j A_k^{ij} Λkii=∑jAkij is a degree matrix.
3. 模型架构
3.1 图卷积神经网络
图卷积神经网络(GCN)是学习图结构数据表示的一种通用且有效的框架。各种 GCN 变体在许多任务上取得了最先进的结果。对于基于骨架的动作识别,设 G t = { V t , E t } G_t = \{ V_t, E_t \} Gt={Vt,Et} 表示在时间 t t t 的单帧人类骨架图,其中 V t V_t Vt 是 N N N 个关节点的集合, E t E_t Et 是骨架边的集合。节点 v t i v_{ti} vti 的邻居集合定义为:
N ( v t i ) = { v t j ∣ d ( v t i , v t j ) ≤ D } N(v_{ti}) = \{ v_{tj} | d(v_{ti}, v_{tj}) \leq D \} N(vti)={vtj∣d(vti,vtj)≤D}
其中, d ( v t i , v t j ) d(v_{ti}, v_{tj}) d(vti,vtj) 是从 v t j v_{tj} vtj 到 v t i v_{ti} vti 的最小路径长度。图标记函数 ℓ : V t → { 1 , 2 , . . . , K } \ell : V_t \rightarrow \{ 1, 2, ..., K \} ℓ:Vt→{1,2,...,K} 被设计为将标签 { 1 , 2 , . . . , K } \{ 1, 2, ..., K \} {1,2,...,K} 分配给每个图节点 v t i ∈ V t v_{ti} \in V_t vti∈Vt,这可以将节点 v t i v_{ti} vti 的邻居集合 N ( v t i ) N(v_{ti}) N(vti) 划分为固定数量的 K K K 个子集。图卷积通常计算为:
Y out ( v t i ) = ∑ v t j ∈ N ( v t i ) 1 Z t i ( v t j ) X ( v t j ) W ( ℓ ( v t j ) ) Y_{\text{out}}(v_{ti}) = \sum_{v_{tj} \in N(v_{ti})} \frac{1}{Z_{ti}(v_{tj})} X(v_{tj}) W(\ell(v_{tj})) Yout(vti)=vtj∈N(vti)∑Zti(vtj)1X(vtj)W(ℓ(vtj))
其中, X ( v t j ) X(v_{tj}) X(vtj) 是节点 v t j v_{tj} vtj 的特征。 W ( ⋅ ) W(\cdot) W(⋅) 是一个根据标签 ℓ ( v t j ) \ell(v_{tj}) ℓ(vtj) 从 K K K 个权重中分配权重的函数。 Z t i ( v t j ) Z_{ti}(v_{tj}) Zti(vtj) 是相应子集的数量,用于标准化特征表示。 Y out ( v t i ) Y_{\text{out}}(v_{ti}) Yout(vti) 表示节点 v t i v_{ti} vti 的图卷积输出。更具体地,结合邻接矩阵,公式 1 可以表示为:
Y out = ∑ k = 1 K Λ k − 1 2 A k Λ k − 1 2 X W k Y_{\text{out}} = \sum_{k=1}^{K} \Lambda_k^{- \frac{1}{2}} A_k \Lambda_k^{- \frac{1}{2}} X W_k Yout=k=1∑KΛk−21AkΛk−21XWk
其中, A k A_k Ak 是标签 k ∈ { 1 , 2 , . . . , K } k \in \{ 1, 2, ..., K \} k∈{1,2,...,K} 的空间配置中的邻接矩阵。 Λ k i i = ∑ j A k i j \Lambda_k^{ii} = \sum_j A_k^{ij} Λkii=∑jAkij 是度矩阵。
3.2. Attention Enhanced Graph Convolutional LSTM
For sequence modeling, a lot of studies have demonstrated that LSTM, as a variant of RNN, has an amazing ability to model long-term temporal dependencies. Various LSTM-based models are employed to learn temporal dynamics of skeleton sequences. However, due to the fully connected operator within LSTM, there is a limitation of ignoring spatial correlation for skeleton-based action recognition. Compared with LSTM, AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains.
Like LSTM, AGC-LSTM also contains three gates: an input gate i t i_t it, a forget gate f t f_t ft, an output gate o t o_t ot. However, these gates are obtained with the graph convolution operator. The input X t X_t Xt, hidden state H t H_t Ht, and cell memory C t C_t Ct of AGC-LSTM are graph-structured data. Fig.3 shows the structure of AGC-LSTM unit. Due to the graph convolutional operator within AGC-LSTM, the cell memory C t C_t Ct and hidden state H t H_t Ht are able to exhibit temporal dynamics, as well as contain spatial structural information. The functions of AGC-LSTM unit are defined as follows:
i t = σ ( W x i ∗ g X t + W h i ∗ g H t − 1 + b i ) i_t = \sigma (W_{xi} *g X_t + W_{hi} *g H_{t-1} + b_i) it=σ(Wxi∗gXt+Whi∗gHt−1+bi)
f t = σ ( W x f ∗ g X t + W h f ∗ g H t − 1 + b f ) f_t = \sigma (W_{xf} *g X_t + W_{hf} *g H_{t-1} + b_f) ft=σ(Wxf∗gXt+Whf∗gHt−1+bf)
o t = σ ( W x o ∗ g X t + W h o ∗ g H t − 1 + b o ) o_t = \sigma (W_{xo} *g X_t + W_{ho} *g H_{t-1} + b_o) ot=σ(Wxo∗gXt+Who∗gHt−1+bo)
u t = tanh ( W x c ∗ g X t + W h c ∗ g H t − 1 + b c ) u_t = \tanh (W_{xc} *g X_t + W_{hc} *g H_{t-1} + b_c) ut=tanh(Wxc∗gXt+Whc∗gHt−1+bc)
C t = f t ⊙ C t − 1 + i t ⊙ u t C_t = f_t \odot C_{t-1} + i_t \odot u_t Ct=ft⊙Ct−1+it⊙ut
H t ^ = o t ⊙ tanh ( C t ) \hat{H_t} = o_t \odot \tanh (C_t) Ht^=ot⊙tanh(Ct)
H t = f a t t ( H t ^ ) + H t ^ H_t = f_{att}(\hat{H_t}) + \hat{H_t} Ht=fatt(Ht^)+Ht^
where ∗ g *g ∗g denotes the graph convolution operator and ⊙ \odot ⊙ denotes the Hadamard product. σ \sigma σ is the sigmoid activation function. u t u_t ut is the modulated input. H t ^ \hat{H_t} Ht^ is an intermediate hidden state. W x i ∗ g X t W_{xi} *g X_t Wxi∗gXt denotes a graph convolution of X t X_t Xt with W x i W_{xi} Wxi, which can be written as Eqn.1. f a t t ( ⋅ ) f_{att}(\cdot) fatt(⋅) is an attention network that can select discriminative information of key nodes. The sum of f a t t ( H t ^ ) f_{att}(\hat{H_t}) fatt(Ht^) and H t ^ \hat{H_t} Ht^ as the output aims to strengthen information of key nodes without weakening information of non-focused nodes, which can maintain the integrity of spatial information.
The attention network is employed to adaptively focus on key joints with a soft attention mechanism that can automatically measure the importance of joints. The illustration of the spatial attention network is shown in Fig.4. The intermediate hidden state H t ^ \hat{H_t} Ht^ of AGC-LSTM contains rich spatial structural information and temporal dynamics that are beneficial in guiding the selection of key joints. So we first aggregate the information of all nodes as a query feature:
q t = R e L U ( ∑ i = 1 N W H t i ^ ) q_t = ReLU (\sum_{i=1}^{N} W \hat{H_{ti}}) qt=ReLU(i=1∑NWHti^)
where W W W is the learnable parameter matrix. Then the attention scores of all nodes can be calculated as:
α t = S i g m o i d ( U s tanh ( W h H t ^ + W q q t + b s ) + b u ) \alpha_t = Sigmoid (U_s \tanh (W_h \hat{H_t} + W_q q_t + b_s) + b_u) αt=Sigmoid(Ustanh(WhHt^+Wqqt+bs)+bu)
where α t = ( α t 1 , α t 2 , . . . , α t N ) \alpha_t = (\alpha_{t1}, \alpha_{t2}, ..., \alpha_{tN}) αt=(αt1,αt2,...,αtN), and U s , W h , W q U_s, W_h, W_q Us,Wh,Wq are the learnable parameter matrices. b s , b u b_s, b_u bs,bu are the bias. We use the non-linear function of Sigmoid due to the possibility of existing multiple key joints. The hidden state H t i H_{ti} Hti of node v t i v_{ti} vti can also be represented as ( 1 + α t i ) ⋅ H t i ^ (1 + \alpha_{ti}) \cdot \hat{H_{ti}} (1+αti)⋅Hti^. The attention-enhanced hidden state H t H_t Ht will be fed into the next AGC-LSTM layer. Note that, at the last AGC-LSTM layer, the aggregation of all node features will serve as a global feature F t g F^g_t Ftg, and the weighted sum of focused nodes will serve as a local feature F t l F^l_t Ftl:
F t g = ∑ i = 1 N H t i F^g_t = \sum_{i=1}^{N} H_{ti} Ftg=i=1∑NHti
F t l = ∑ i = 1 N α t i ⋅ H t i ^ F^l_t = \sum_{i=1}^{N} \alpha_{ti} \cdot \hat{H_{ti}} Ftl=i=1∑Nαti⋅Hti^
The global feature F t g F^g_t Ftg and local feature F t l F^l_t Ftl are used to predict the class of human action.
3.2 注意力增强图卷积 LSTM
在序列建模方面,许多研究表明,LSTM 作为 RNN 的一种变体,具有惊人的建模长期时间依赖性的能力。各种基于 LSTM 的模型用于学习骨架序列的时间动态。然而,由于 LSTM 中的全连接算子,存在忽略空间相关性的局限性,特别是在基于骨架的动作识别中。与 LSTM 相比,AGC-LSTM 不仅能够捕捉空间配置和时间动态中的判别性特征,还能够探索空间和时间域之间的共现关系。
与 LSTM 类似,AGC-LSTM 也包含三个门:输入门 i t i_t it、遗忘门 f t f_t ft 和输出门 o t o_t ot。然而,这些门通过图卷积操作获得。输入 X t X_t Xt、隐藏状态 H t H_t Ht 和单元记忆 C t C_t Ct 均为图结构数据。图 3 显示了 AGC-LSTM 单元的结构。由于 AGC-LSTM 内部的图卷积操作,单元记忆 C t C_t Ct 和隐藏状态 H t H_t Ht 不仅表现出时间动态,还包含空间结构信息。AGC-LSTM 单元的功能定义如下:
i t = σ ( W x i ∗ g X t + W h i ∗ g H t − 1 + b i ) i_t = \sigma (W_{xi} *g X_t + W_{hi} *g H_{t-1} + b_i) it=σ(Wxi∗gXt+Whi∗gHt−1+bi)
f t = σ ( W x f ∗ g X t + W h f ∗ g H t − 1 + b f ) f_t = \sigma (W_{xf} *g X_t + W_{hf} *g H_{t-1} + b_f) ft=σ(Wxf∗gXt+Whf∗gHt−1+bf)
o t = σ ( W x o ∗ g X t + W h o ∗ g H t − 1 + b o ) o_t = \sigma (W_{xo} *g X_t + W_{ho} *g H_{t-1} + b_o) ot=σ(Wxo∗gXt+Who∗gHt−1+bo)
u t = tanh ( W x c ∗ g X t + W h c ∗ g H t − 1 + b c ) u_t = \tanh (W_{xc} *g X_t + W_{hc} *g H_{t-1} + b_c) ut=tanh(Wxc∗gXt+Whc∗gHt−1+bc)
C t = f t ⊙ C t − 1 + i t ⊙ u t C_t = f_t \odot C_{t-1} + i_t \odot u_t Ct=ft⊙Ct−1+it⊙ut
H t ^ = o t ⊙ tanh ( C t ) \hat{H_t} = o_t \odot \tanh (C_t) Ht^=ot⊙tanh(Ct)
H t = f a t t ( H t ^ ) + H t ^ H_t = f_{att}(\hat{H_t}) + \hat{H_t} Ht=fatt(Ht^)+Ht^
其中, ∗ g *g ∗g 表示图卷积操作, ⊙ \odot ⊙ 表示 Hadamard 乘积。 σ \sigma σ 是 sigmoid 激活函数。 u t u_t ut 是调制输入。 H t ^ \hat{H_t} Ht^ 是中间隐藏状态。 W x i ∗ g X t W_{xi} *g X_t Wxi∗gXt 表示 X t X_t Xt 与 W x i W_{xi} Wxi 的图卷积,可以表示为公式 1。 f a t t ( ⋅ ) f_{att}(\cdot) fatt(⋅) 是一种注意力网络,可以选择关键节点的判别性信息。 f a t t ( H t ^ ) f_{att}(\hat{H_t}) fatt(Ht^) 与 H t ^ \hat{H_t} Ht^ 的总和作为输出,旨在加强关键节点的信息,而不会削弱非聚焦节点的信息,从而保持空间信息的完整性。
注意力网络用于自适应地聚焦关键关节,具有一种软注意力机制,可以自动衡量关节的重要性。图 4 显示了空间注意力网络的示意图。AGC-LSTM 的中间隐藏状态 H t ^ \hat{H_t} Ht^ 包含丰富的空间结构信息和时间动态,有助于指导关键关节的选择。因此,我们首先聚合所有节点的信息作为查询特征:
q t = R e L U ( ∑ i = 1 N W H t i ^ ) q_t = ReLU (\sum_{i=1}^{N} W \hat{H_{ti}}) qt=ReLU(i=1∑NWHti^)
其中 W W W 是可学习的参数矩阵。然后,所有节点的注意力得分可以计算为:
α t = S i g m o i d ( U s tanh ( W h H t ^ + W q q t + b s ) + b u ) \alpha_t = Sigmoid (U_s \tanh (W_h \hat{H_t} + W_q q_t + b_s) + b_u) αt=Sigmoid(Ustanh(WhHt^+Wqqt+bs)+bu)
其中, α t = ( α t 1 , α t 2 , . . . , α t N ) \alpha_t = (\alpha_{t1}, \alpha_{t2}, ..., \alpha_{tN}) αt=(αt1,αt2,...,αtN),且 U s , W h , W q U_s, W_h, W_q Us,Wh,Wq 是可学习的参数矩阵。 b s , b u b_s, b_u bs,bu 是偏差。由于存在多个关键关节的可能性,我们使用 Sigmoid 非线性函数。节点 v t i v_{ti} vti 的隐藏状态 H t i H_{ti} Hti 也可以表示为 ( 1 + α t i ) ⋅ H t i ^ (1 + \alpha_{ti}) \cdot \hat{H_{ti}} (1+αti)⋅Hti^。增强的隐藏状态 H t H_t Ht 将被输入到下一个 AGC-LSTM 层中。请注意,在最后的 AGC-LSTM 层,所有节点特征的聚合将作为全局特征 F t g F^g_t Ftg,而聚焦节点的加权和将作为局部特征 F t l F^l_t Ftl:
F t g = ∑ i = 1 N H t i F^g_t = \sum_{i=1}^{N} H_{ti} Ftg=i=1∑NHti
F t l = ∑ i = 1 N α t i ⋅ H t i ^ F^l_t = \sum_{i=1}^{N} \alpha_{ti} \cdot \hat{H_{ti}} Ftl=i=1∑Nαti⋅Hti^
全局特征 F t g F^g_t Ftg 和局部特征 F t l F^l_t Ftl 用于预测人类动作的类别。
3.3. AGC-LSTM Network
We propose an end-to-end attention-enhanced graph convolutional LSTM network (AGC-LSTM) for skeleton-based human action recognition. The overall pipeline of our model is shown in Fig.2. In the following, we discuss the rationale behind the proposed framework in detail.
Joints Feature Representation
For the skeleton sequence, we first map the 3D coordinate of each joint into a high-dimensional feature space using a linear layer and an LSTM layer. The first linear layer encodes the coordinates of joints into a 256-dim vector as position features P t ∈ R N × 256 P_t \in \mathbb{R}^{N \times 256} Pt∈RN×256, and P t i ∈ R 1 × 256 P_{ti} \in \mathbb{R}^{1 \times 256} Pti∈R1×256 denotes the position representation of joint i i i. Due to only containing position information, the position feature P t i P_{ti} Pti is beneficial for learning spatially structured characteristics in the graph model. Frame difference features V t i V_{ti} Vti between two consecutive frames can facilitate the acquisition of dynamic information for AGC-LSTM. In order to take into account both advantages, the concatenation of both features serves as an augmented feature to enrich feature information. However, the concatenation of position feature P t i P_{ti} Pti and frame difference feature V t i V_{ti} Vti exists due to the scale variance of the feature vectors. Therefore, we adopt an LSTM layer to dispel scale variance between both features:
E t i = f l s t m ( concat ( P t i , V t i ) ) = f l s t m ( concat ( P t i , ( P t i − P ( t − 1 ) i ) ) ) ( 8 ) E_{ti} = f_{lstm}(\text{concat}(P_{ti}, V_{ti})) = f_{lstm}(\text{concat}(P_{ti}, (P_{ti} - P_{(t-1)i}))) \quad (8) Eti=flstm(concat(Pti,Vti))=flstm(concat(Pti,(Pti−P(t−1)i)))(8)
where E t i E_{ti} Eti is the augmented feature of joint i i i at time t t t. Note that the linear layer and LSTM are shared among different joints.
Temporal Hierarchical Architecture
After the LSTM layer, the sequence { E 1 , E 2 , . . . , E T } \{E_1, E_2, ..., E_T\} {E1,E2,...,ET} of augmented features will be fed into the following GC-LSTM layers as the node features, where E t ∈ R N × d e E_t \in \mathbb{R}^{N \times d_e} Et∈RN×de. The proposed model stacks three AGC-LSTM layers to learn the spatial configuration and temporal dynamics. Inspired by spatial pooling in CNNs, we present a temporal hierarchical architecture of AGC-LSTM with average pooling in the temporal domain to increase the temporal receptive field of the top AGC-LSTM layers. Through the temporal hierarchical architecture, the temporal receptive field of each time input at the top AGC-LSTM layer becomes a short-term clip from a frame, which can be more sensitive to the perception of the temporal dynamics. In addition, it can significantly reduce computational cost on the premise of improving performance.
Learning AGC-LSTM
Finally, the global feature F t g F^g_t Ftg and local feature F t l F^l_t Ftl of each time step are transformed into the scores o t g o^g_t otg and o t l o^l_t otl for C C C classes, where o t = ( o t 1 , o t 2 , . . . , o t C ) o_t = (o_{t1}, o_{t2}, ..., o_{tC}) ot=(ot1,ot2,...,otC). And the predicted probability of the i t h i^{th} ith class is then obtained as:
y ^ t i = e o t i ∑ j = 1 C e o t j , i = 1 , . . . , C ( 9 ) \hat{y}_{ti} = \frac{e^{o_{ti}}}{\sum_{j=1}^{C} e^{o_{tj}}}, \quad i = 1, ..., C \quad (9) y^ti=∑j=1Ceotjeoti,i=1,...,C(9)
During training, considering that the hidden state of each time step on the top AGC-LSTM contains a short-term dynamic, we supervise our model with the following loss:
L = − ∑ t = 1 T 3 ∑ i = 1 C y i log y ^ t i g − ∑ t = 1 T 3 ∑ i = 1 C y i log y ^ t i l + λ ∑ j = 1 3 ∑ n = 1 N ( 1 − ∑ t = 1 T j α t n j T j ) 2 + β ∑ j = 1 3 1 T j ∑ t = 1 T j ∑ n = 1 N α t n j 2 ( 10 ) L = - \sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^g_{ti} - \sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^l_{ti} \quad + \lambda \sum_{j=1}^{3} \sum_{n=1}^{N} \left( 1 - \frac{ \sum_{t=1}^{T_j} \alpha_{tnj}}{T_j} \right)^2 + \beta \sum_{j=1}^{3} \frac{1}{T_j} \sum_{t=1}^{T_j} \sum_{n=1}^{N} \alpha_{tnj}^2 (10) L=−t=1∑T3i=1∑Cyilogy^tig−t=1∑T3i=1∑Cyilogy^til+λj=1∑3n=1∑N(1−Tj∑t=1Tjαtnj)2+βj=1∑3Tj1t=1∑Tjn=1∑Nαtnj2(10)
where y = ( y 1 , . . . , y C ) \mathbf{y} = (y_1, ..., y_C) y=(y1,...,yC) is the ground truth label. T j T_j Tj denotes the number of time steps on the j t h j^{th} jth AGC-LSTM layer. The third term aims to pay equal attention to different joints. The last term is to limit the number of interested nodes. λ \lambda λ and β \beta β are weight decay coefficients. Note that only the sum probability of y ^ T 3 g \hat{y}^g_{T_3} y^T3g and y ^ T 3 l \hat{y}^l_{T_3} y^T3l at the last time step is used to predict the class of the human action.
Although the joint-based AGC-LSTM network has achieved the state-of-the-art results, we also explore the performance of the proposed model on the part level. According to human physical structure, the body can be divided into several parts. Similar to joint-based AGC-LSTM network, we first capture part features with a linear layer and a shared LSTM layer. Then the part features as node representations are fed into three AGC-LSTM layers to model spatio-temporal characteristics. The results illustrate that our model can also achieve superior performance on the part level. Furthermore, the hybrid model (shown in Fig.5) based on joints and parts can lead to further performance improvement.
3.3 AGC-LSTM 网络
我们提出了一种端到端的注意力增强图卷积 LSTM 网络(AGC-LSTM),用于基于骨架的人类动作识别。模型的整体流程如图 2 所示。接下来,我们将详细讨论该框架背后的基本原理。
关节特征表示
对于骨架序列,我们首先通过线性层和 LSTM 层将每个关节的 3D 坐标映射到高维特征空间。第一个线性层将关节的坐标编码为 256 维向量,作为位置特征 P t ∈ R N × 256 P_t \in \mathbb{R}^{N \times 256} Pt∈RN×256,其中 P t i ∈ R 1 × 256 P_{ti} \in \mathbb{R}^{1 \times 256} Pti∈R1×256 表示关节 i i i 的位置表示。由于仅包含位置信息,位置特征 P t i P_{ti} Pti 有助于学习图模型中的空间结构特征。两帧之间的帧差特征 V t i V_{ti} Vti 可以促进 AGC-LSTM 动态信息的获取。为了同时考虑这两者的优点,两个特征的拼接作为增强特征以丰富特征信息。然而,位置特征 P t i P_{ti} Pti 和帧差特征 V t i V_{ti} Vti 的拼接会出现特征向量的尺度差异。因此,我们采用 LSTM 层来消除这两个特征之间的尺度差异:
E t i = f l s t m ( concat ( P t i , V t i ) ) = f l s t m ( concat ( P t i , ( P t i − P ( t − 1 ) i ) ) ) ( 8 ) E_{ti} = f_{lstm}(\text{concat}(P_{ti}, V_{ti})) = f_{lstm}(\text{concat}(P_{ti}, (P_{ti} - P_{(t-1)i}))) \quad (8) Eti=flstm(concat(Pti,Vti))=flstm(concat(Pti,(Pti−P(t−1)i)))(8)
其中 E t i E_{ti} Eti 是时间 t t t 时关节 i i i 的增强特征。注意,线性层和 LSTM 层在不同关节之间是共享的。
时间层次结构
在 LSTM 层之后,增强特征序列 { E 1 , E 2 , . . . , E T } \{E_1, E_2, ..., E_T\} {E1,E2,...,ET} 将被传递到后续的 GC-LSTM 层,作为节点特征,其中 E t ∈ R N × d e E_t \in \mathbb{R}^{N \times d_e} Et∈RN×de。我们提出的模型堆叠了三层 AGC-LSTM,以学习空间配置和时间动态。受到 CNN 中空间池化的启发,我们提出了一种 AGC-LSTM 的时间层次结构,通过时间域中的平均池化来增加顶层 AGC-LSTM 层的时间感受野。通过这种时间层次结构,顶层 AGC-LSTM 层的每个时间输入的时间感受野变成了从帧中截取的短期片段,这样可以更敏感地感知时间动态。此外,它还能显著降低计算成本,同时提高性能。
AGC-LSTM 学习
最后,每个时间步的全局特征 F t g F^g_t Ftg 和局部特征 F t l F^l_t Ftl 被转化为 C C C 类别的分数 o t g o^g_t otg 和 o t l o^l_t otl,其中 o t = ( o t 1 , o t 2 , . . . , o t C ) o_t = (o_{t1}, o_{t2}, ..., o_{tC}) ot=(ot1,ot2,...,otC)。第 i i i 类的预测概率可以通过以下公式获得:
y ^ t i = e o t i ∑ j = 1 C e o t j , i = 1 , . . . , C ( 9 ) \hat{y}_{ti} = \frac{e^{o_{ti}}}{\sum_{j=1}^{C} e^{o_{tj}}}, \quad i = 1, ..., C \quad (9) y^ti=∑j=1Ceotjeoti,i=1,...,C(9)
在训练过程中,考虑到顶层 AGC-LSTM 的每个时间步的隐藏状态包含短期动态,我们通过以下损失函数来监督我们的模型:
L = − ∑ t = 1 T 3 ∑ i = 1 C y i log y ^ t i g − ∑ t = 1 T 3 ∑ i = 1 C y i log y ^ t i l + λ ∑ j = 1 3 ∑ n = 1 N ( 1 − ∑ t = 1 T j α t n j T j ) 2 + β ∑ j = 1 3 1 T j ∑ t = 1 T j ∑ n = 1 N α t n j 2 ( 10 ) L = - \sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^g_{ti} - \sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^l_{ti} \quad + \lambda \sum_{j=1}^{3} \sum_{n=1}^{N} \left( 1 - \frac{ \sum_{t=1}^{T_j} \alpha_{tnj}}{T_j} \right)^2 + \beta \sum_{j=1}^{3} \frac{1}{T_j} \sum_{t=1}^{T_j} \sum_{n=1}^{N} \alpha_{tnj}^2 (10) L=−t=1∑T3i=1∑Cyilogy^tig−t=1∑T3i=1∑Cyilogy^til+λj=1∑3n=1∑N(1−Tj∑t=1Tjαtnj)2+βj=1∑3Tj1t=1∑Tjn=1∑Nαtnj2(10)
其中 y = ( y 1 , . . . , y C ) \mathbf{y} = (y_1, ..., y_C) y=(y1,...,yC) 是真实标签, T j T_j Tj 表示第 j j j 层 AGC-LSTM 中的时间步数。第三项旨在对不同的关节给予同等的关注。最后一项用于限制感兴趣的节点数量。参数 λ \lambda λ 和 β \beta β 是权重衰减系数。注意,只有最后一个时间步的 y ^ T 3 g \hat{y}^g_{T_3} y^T3g 和 y ^ T 3 l \hat{y}^l_{T_3} y^T3l 的概率总和用于预测人类动作的类别。
尽管基于关节的 AGC-LSTM 网络已经取得了最先进的结果,我们还探索了该模型在局部层面上的表现。根据人体结构,身体可以分为几个部分。与基于关节的 AGC-LSTM 网络类似,我们首先通过线性层和共享的 LSTM 层捕获局部特征。然后将局部特征作为节点表示输入到三层 AGC-LSTM 中,来建模时空特征。结果表明,我们的模型在局部层面上也能取得优异的表现。此外,基于关节和部分的混合模型(如图 5 所示)可以进一步提高性能。