论文研读—— An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition2-CSDN博客

3. Model Architecture

3.1. Graph Convolutional Neural Network

Graph convolutional neural network (GCN) is a general and effective framework for learning representation of graph-structured data. Various GCN variants have achieved state-of-the-art results on many tasks. For skeleton-based action recognition, let $G_t = \{ V_t, E_t \}$ denotes a graph of human skeleton on a single frame at time $t$ , where $V_t$ is the set of $N$ joint nodes and $E_t$ is the set of skeleton edges. The neighbor set of a node $v_{ti}$ is defined as:

$N(v_{ti}) = \{ v_{tj} | d(v_{ti}, v_{tj}) \leq D \},$

where $d(v_{ti}, v_{tj})$ is the minimum path length from $v_{tj}$ to $v_{ti}$ . A graph labeling function $\ell : V_t \rightarrow \{ 1, 2, ..., K \}$ is designed to assign the labels ${ 1, 2, ..., K \}$ to each graph node $v_{ti} \in V_t$ , which can partition the neighbor set $N(v_{ti})$ of node $v_{ti}$ into a fixed number of $K$ subsets. The graph convolution is generally computed as:

$Y_{\text{out}}(v_{ti}) = \sum_{v_{tj} \in N(v_{ti})} \frac{1}{Z_{ti}(v_{tj})} X(v_{tj}) W(\ell(v_{tj}))$

where $X(v_{tj})$ is the feature of node $v_{tj}$ . $W(\cdot)$ is a weight function that allocates a weight indexed by the label $\ell(v_{tj})$ from $K$ weights. $Z_{ti}(v_{tj})$ is the number of the corresponding subset, which normalizes feature representations. $Y_{\text{out}}(v_{ti})$ denotes the output of graph convolution at node $v_{ti}$ . More specifically, with the adjacency matrix, the Eqn. 1 can be represented as:

$Y_{\text{out}} = \sum_{k=1}^{K} \Lambda_k^{- \frac{1}{2}} A_k \Lambda_k^{- \frac{1}{2}} X W_k$

where $A_k$ is the adjacency matrix in spatial configuration of the label $\in \{ 1, 2, ..., K \}$ . $\Lambda_k^{ii} = \sum_j A_k^{ij}$ is a degree matrix.

3. 模型架构

3.1 图卷积神经网络

图卷积神经网络（GCN）是学习图结构数据表示的一种通用且有效的框架。各种 GCN 变体在许多任务上取得了最先进的结果。对于基于骨架的动作识别，设 $G_t = \{ V_t, E_t \}$ 表示在时间 $t$ 的单帧人类骨架图，其中 $V_t$ 是 $N$ 个关节点的集合， $E_t$ 是骨架边的集合。节点 $v_{ti}$ 的邻居集合定义为：

$N(v_{ti}) = \{ v_{tj} | d(v_{ti}, v_{tj}) \leq D \}$

其中， $d(v_{ti}, v_{tj})$ 是从 $v_{tj}$ 到 $v_{ti}$ 的最小路径长度。图标记函数 $\ell : V_t \rightarrow \{ 1, 2, ..., K \}$ 被设计为将标签 ${ 1, 2, ..., K \}$ 分配给每个图节点 $v_{ti} \in V_t$ ，这可以将节点 $v_{ti}$ 的邻居集合 $N(v_{ti})$ 划分为固定数量的 $K$ 个子集。图卷积通常计算为：

$Y_{\text{out}}(v_{ti}) = \sum_{v_{tj} \in N(v_{ti})} \frac{1}{Z_{ti}(v_{tj})} X(v_{tj}) W(\ell(v_{tj}))$

其中， $X(v_{tj})$ 是节点 $v_{tj}$ 的特征。 $W(\cdot)$ 是一个根据标签 $\ell(v_{tj})$ 从 $K$ 个权重中分配权重的函数。 $Z_{ti}(v_{tj})$ 是相应子集的数量，用于标准化特征表示。 $Y_{\text{out}}(v_{ti})$ 表示节点 $v_{ti}$ 的图卷积输出。更具体地，结合邻接矩阵，公式 1 可以表示为：

$Y_{\text{out}} = \sum_{k=1}^{K} \Lambda_k^{- \frac{1}{2}} A_k \Lambda_k^{- \frac{1}{2}} X W_k$

其中， $A_k$ 是标签 $\in \{ 1, 2, ..., K \}$ 的空间配置中的邻接矩阵。 $\Lambda_k^{ii} = \sum_j A_k^{ij}$ 是度矩阵。

3.2. Attention Enhanced Graph Convolutional LSTM

For sequence modeling, a lot of studies have demonstrated that LSTM, as a variant of RNN, has an amazing ability to model long-term temporal dependencies. Various LSTM-based models are employed to learn temporal dynamics of skeleton sequences. However, due to the fully connected operator within LSTM, there is a limitation of ignoring spatial correlation for skeleton-based action recognition. Compared with LSTM, AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains.

在这里插入图片描述

Like LSTM, AGC-LSTM also contains three gates: an input gate $i_t$ , a forget gate $f_t$ , an output gate $o_t$ . However, these gates are obtained with the graph convolution operator. The input $X_t$ , hidden state $H_t$ , and cell memory $C_t$ of AGC-LSTM are graph-structured data. Fig.3 shows the structure of AGC-LSTM unit. Due to the graph convolutional operator within AGC-LSTM, the cell memory $C_t$ and hidden state $H_t$ are able to exhibit temporal dynamics, as well as contain spatial structural information. The functions of AGC-LSTM unit are defined as follows:

$i_t = \sigma (W_{xi} *g X_t + W_{hi} *g H_{t-1} + b_i)$

$f_t = \sigma (W_{xf} *g X_t + W_{hf} *g H_{t-1} + b_f)$

$o_t = \sigma (W_{xo} *g X_t + W_{ho} *g H_{t-1} + b_o)$

$u_t = \tanh (W_{xc} *g X_t + W_{hc} *g H_{t-1} + b_c)$

$C_t = f_t \odot C_{t-1} + i_t \odot u_t$

$\hat{H_t} = o_t \odot \tanh (C_t)$

$H_t = f_{att}(\hat{H_t}) + \hat{H_t}$

where $* g$ denotes the graph convolution operator and $\odot$ denotes the Hadamard product. $\sigma$ is the sigmoid activation function. $u_t$ is the modulated input. $\hat{H_t}$ is an intermediate hidden state. $W_{xi} *g X_t$ denotes a graph convolution of $X_t$ with $W_{xi}$ , which can be written as Eqn.1. $f_{att}(\cdot)$ is an attention network that can select discriminative information of key nodes. The sum of $f_{att}(\hat{H_t})$ and $\hat{H_t}$ as the output aims to strengthen information of key nodes without weakening information of non-focused nodes, which can maintain the integrity of spatial information.

The attention network is employed to adaptively focus on key joints with a soft attention mechanism that can automatically measure the importance of joints. The illustration of the spatial attention network is shown in Fig.4. The intermediate hidden state $\hat{H_t}$ of AGC-LSTM contains rich spatial structural information and temporal dynamics that are beneficial in guiding the selection of key joints. So we first aggregate the information of all nodes as a query feature:

$q_t = ReLU (\sum_{i=1}^{N} W \hat{H_{ti}})$

在这里插入图片描述

where $W$ is the learnable parameter matrix. Then the attention scores of all nodes can be calculated as:

$\alpha_t = Sigmoid (U_s \tanh (W_h \hat{H_t} + W_q q_t + b_s) + b_u)$

where $\alpha_t = (\alpha_{t1}, \alpha_{t2}, ..., \alpha_{tN})$ , and $U_s, W_h, W_q$ are the learnable parameter matrices. $b_s, b_u$ are the bias. We use the non-linear function of Sigmoid due to the possibility of existing multiple key joints. The hidden state $H_{ti}$ of node $v_{ti}$ can also be represented as $\alpha_{ti}) \cdot \hat{H_{ti}}$ . The attention-enhanced hidden state $H_t$ will be fed into the next AGC-LSTM layer. Note that, at the last AGC-LSTM layer, the aggregation of all node features will serve as a global feature $F^g_t$ , and the weighted sum of focused nodes will serve as a local feature $F^l_t$ :

$F^g_t = \sum_{i=1}^{N} H_{ti}$

$F^l_t = \sum_{i=1}^{N} \alpha_{ti} \cdot \hat{H_{ti}}$

The global feature $F^g_t$ and local feature $F^l_t$ are used to predict the class of human action.

3.2 注意力增强图卷积 LSTM

在序列建模方面，许多研究表明，LSTM 作为 RNN 的一种变体，具有惊人的建模长期时间依赖性的能力。各种基于 LSTM 的模型用于学习骨架序列的时间动态。然而，由于 LSTM 中的全连接算子，存在忽略空间相关性的局限性，特别是在基于骨架的动作识别中。与 LSTM 相比，AGC-LSTM 不仅能够捕捉空间配置和时间动态中的判别性特征，还能够探索空间和时间域之间的共现关系。

与 LSTM 类似，AGC-LSTM 也包含三个门：输入门 $i_t$ 、遗忘门 $f_t$ 和输出门 $o_t$ 。然而，这些门通过图卷积操作获得。输入 $X_t$ 、隐藏状态 $H_t$ 和单元记忆 $C_t$ 均为图结构数据。图 3 显示了 AGC-LSTM 单元的结构。由于 AGC-LSTM 内部的图卷积操作，单元记忆 $C_t$ 和隐藏状态 $H_t$ 不仅表现出时间动态，还包含空间结构信息。AGC-LSTM 单元的功能定义如下：

$i_t = \sigma (W_{xi} *g X_t + W_{hi} *g H_{t-1} + b_i)$

$f_t = \sigma (W_{xf} *g X_t + W_{hf} *g H_{t-1} + b_f)$

$o_t = \sigma (W_{xo} *g X_t + W_{ho} *g H_{t-1} + b_o)$

$u_t = \tanh (W_{xc} *g X_t + W_{hc} *g H_{t-1} + b_c)$

$C_t = f_t \odot C_{t-1} + i_t \odot u_t$

$\hat{H_t} = o_t \odot \tanh (C_t)$

$H_t = f_{att}(\hat{H_t}) + \hat{H_t}$

其中， $* g$ 表示图卷积操作， $\odot$ 表示 Hadamard 乘积。 $\sigma$ 是 sigmoid 激活函数。 $u_t$ 是调制输入。 $\hat{H_t}$ 是中间隐藏状态。 $W_{xi} *g X_t$ 表示 $X_t$ 与 $W_{xi}$ 的图卷积，可以表示为公式 1。 $f_{att}(\cdot)$ 是一种注意力网络，可以选择关键节点的判别性信息。 $f_{att}(\hat{H_t})$ 与 $\hat{H_t}$ 的总和作为输出，旨在加强关键节点的信息，而不会削弱非聚焦节点的信息，从而保持空间信息的完整性。

注意力网络用于自适应地聚焦关键关节，具有一种软注意力机制，可以自动衡量关节的重要性。图 4 显示了空间注意力网络的示意图。AGC-LSTM 的中间隐藏状态 $\hat{H_t}$ 包含丰富的空间结构信息和时间动态，有助于指导关键关节的选择。因此，我们首先聚合所有节点的信息作为查询特征：

$q_t = ReLU (\sum_{i=1}^{N} W \hat{H_{ti}})$

其中 $W$ 是可学习的参数矩阵。然后，所有节点的注意力得分可以计算为：

$\alpha_t = Sigmoid (U_s \tanh (W_h \hat{H_t} + W_q q_t + b_s) + b_u)$

其中， $\alpha_t = (\alpha_{t1}, \alpha_{t2}, ..., \alpha_{tN})$ ，且 $U_s, W_h, W_q$ 是可学习的参数矩阵。 $b_s, b_u$ 是偏差。由于存在多个关键关节的可能性，我们使用 Sigmoid 非线性函数。节点 $v_{ti}$ 的隐藏状态 $H_{ti}$ 也可以表示为 $\alpha_{ti}) \cdot \hat{H_{ti}}$ 。增强的隐藏状态 $H_t$ 将被输入到下一个 AGC-LSTM 层中。请注意，在最后的 AGC-LSTM 层，所有节点特征的聚合将作为全局特征 $F^g_t$ ，而聚焦节点的加权和将作为局部特征 $F^l_t$ ：

$F^g_t = \sum_{i=1}^{N} H_{ti}$

$F^l_t = \sum_{i=1}^{N} \alpha_{ti} \cdot \hat{H_{ti}}$

全局特征 $F^g_t$ 和局部特征 $F^l_t$ 用于预测人类动作的类别。

3.3. AGC-LSTM Network

We propose an end-to-end attention-enhanced graph convolutional LSTM network (AGC-LSTM) for skeleton-based human action recognition. The overall pipeline of our model is shown in Fig.2. In the following, we discuss the rationale behind the proposed framework in detail.

Joints Feature Representation

For the skeleton sequence, we first map the 3D coordinate of each joint into a high-dimensional feature space using a linear layer and an LSTM layer. The first linear layer encodes the coordinates of joints into a 256-dim vector as position features $P_t \in \mathbb{R}^{N \times 256}$ , and $P_{ti} \in \mathbb{R}^{1 \times 256}$ denotes the position representation of joint $i$ . Due to only containing position information, the position feature $P_{ti}$ is beneficial for learning spatially structured characteristics in the graph model. Frame difference features $V_{ti}$ between two consecutive frames can facilitate the acquisition of dynamic information for AGC-LSTM. In order to take into account both advantages, the concatenation of both features serves as an augmented feature to enrich feature information. However, the concatenation of position feature $P_{ti}$ and frame difference feature $V_{ti}$ exists due to the scale variance of the feature vectors. Therefore, we adopt an LSTM layer to dispel scale variance between both features:

$E_{ti} = f_{lstm}(\text{concat}(P_{ti}, V_{ti})) = f_{lstm}(\text{concat}(P_{ti}, (P_{ti} - P_{(t-1)i}))) \quad (8)$

where $E_{ti}$ is the augmented feature of joint $i$ at time $t$ . Note that the linear layer and LSTM are shared among different joints.

Temporal Hierarchical Architecture

After the LSTM layer, the sequence ${E_1, E_2, ..., E_T\}$ of augmented features will be fed into the following GC-LSTM layers as the node features, where $E_t \in \mathbb{R}^{N \times d_e}$ . The proposed model stacks three AGC-LSTM layers to learn the spatial configuration and temporal dynamics. Inspired by spatial pooling in CNNs, we present a temporal hierarchical architecture of AGC-LSTM with average pooling in the temporal domain to increase the temporal receptive field of the top AGC-LSTM layers. Through the temporal hierarchical architecture, the temporal receptive field of each time input at the top AGC-LSTM layer becomes a short-term clip from a frame, which can be more sensitive to the perception of the temporal dynamics. In addition, it can significantly reduce computational cost on the premise of improving performance.

Learning AGC-LSTM

Finally, the global feature $F^g_t$ and local feature $F^l_t$ of each time step are transformed into the scores $o^g_t$ and $o^l_t$ for $C$ classes, where $o_t = (o_{t1}, o_{t2}, ..., o_{tC})$ . And the predicted probability of the $i^{th}$ class is then obtained as:

$\hat{y}_{ti} = \frac{e^{o_{ti}}}{\sum_{j=1}^{C} e^{o_{tj}}}, \quad i = 1, ..., C \quad (9)$

During training, considering that the hidden state of each time step on the top AGC-LSTM contains a short-term dynamic, we supervise our model with the following loss:

$\sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^g_{ti} - \sum_{t=1}^{T_3} \sum_{i=1}^{C} y_i \log \hat{y}^l_{ti} \quad + \lambda \sum_{j=1}^{3} \sum_{n=1}^{N} \left( 1 - \frac{ \sum_{t=1}^{T_j} \alpha_{tnj}}{T_j} \right)^2 + \beta \sum_{j=1}^{3} \frac{1}{T_j} \sum_{t=1}^{T_j} \sum_{n=1}^{N} \alpha_{tnj}^2 (10)$

where $\mathbf{y} = (y_1, ..., y_C)$ is the ground truth label. $T_j$ denotes the number of time steps on the $j^{th}$ AGC-LSTM layer. The third term aims to pay equal attention to different joints. The last term is to limit the number of interested nodes. $\lambda$ and $\beta$ are weight decay coefficients. Note that only the sum probability of $\hat{y}^g_{T_3}$ and $\hat{y}^l_{T_3}$ at the last time step is used to predict the class of the human action.

在这里插入图片描述

Although the joint-based AGC-LSTM network has achieved the state-of-the-art results, we also explore the performance of the proposed model on the part level. According to human physical structure, the body can be divided into several parts. Similar to joint-based AGC-LSTM network, we first capture part features with a linear layer and a shared LSTM layer. Then the part features as node representations are fed into three AGC-LSTM layers to model spatio-temporal characteristics. The results illustrate that our model can also achieve superior performance on the part level. Furthermore, the hybrid model (shown in Fig.5) based on joints and parts can lead to further performance improvement.

3.3 AGC-LSTM 网络

我们提出了一种端到端的注意力增强图卷积 LSTM 网络（AGC-LSTM），用于基于骨架的人类动作识别。模型的整体流程如图 2 所示。接下来，我们将详细讨论该框架背后的基本原理。

关节特征表示

对于骨架序列，我们首先通过线性层和 LSTM 层将每个关节的 3D 坐标映射到高维特征空间。第一个线性层将关节的坐标编码为 256 维向量，作为位置特征 $P_t \in \mathbb{R}^{N \times 256}$ ，其中 $P_{ti} \in \mathbb{R}^{1 \times 256}$ 表示关节 $i$ 的位置表示。由于仅包含位置信息，位置特征 $P_{ti}$ 有助于学习图模型中的空间结构特征。两帧之间的帧差特征 $V_{ti}$ 可以促进 AGC-LSTM 动态信息的获取。为了同时考虑这两者的优点，两个特征的拼接作为增强特征以丰富特征信息。然而，位置特征 $P_{ti}$ 和帧差特征 $V_{ti}$ 的拼接会出现特征向量的尺度差异。因此，我们采用 LSTM 层来消除这两个特征之间的尺度差异：

$E_{ti} = f_{lstm}(\text{concat}(P_{ti}, V_{ti})) = f_{lstm}(\text{concat}(P_{ti}, (P_{ti} - P_{(t-1)i}))) \quad (8)$

其中 $E_{ti}$ 是时间 $t$ 时关节 $i$ 的增强特征。注意，线性层和 LSTM 层在不同关节之间是共享的。

时间层次结构

在 LSTM 层之后，增强特征序列 ${E_1, E_2, ..., E_T\}$ 将被传递到后续的 GC-LSTM 层，作为节点特征，其中 $E_t \in \mathbb{R}^{N \times d_e}$ 。我们提出的模型堆叠了三层 AGC-LSTM，以学习空间配置和时间动态。受到 CNN 中空间池化的启发，我们提出了一种 AGC-LSTM 的时间层次结构，通过时间域中的平均池化来增加顶层 AGC-LSTM 层的时间感受野。通过这种时间层次结构，顶层 AGC-LSTM 层的每个时间输入的时间感受野变成了从帧中截取的短期片段，这样可以更敏感地感知时间动态。此外，它还能显著降低计算成本，同时提高性能。

AGC-LSTM 学习

最后，每个时间步的全局特征 $F^g_t$ 和局部特征 $F^l_t$ 被转化为 $C$ 类别的分数 $o^g_t$ 和 $o^l_t$ ，其中 $o_t = (o_{t1}, o_{t2}, ..., o_{tC})$ 。第 $i$ 类的预测概率可以通过以下公式获得：

$\hat{y}_{ti} = \frac{e^{o_{ti}}}{\sum_{j=1}^{C} e^{o_{tj}}}, \quad i = 1, ..., C \quad (9)$

在训练过程中，考虑到顶层 AGC-LSTM 的每个时间步的隐藏状态包含短期动态，我们通过以下损失函数来监督我们的模型：

其中 $\mathbf{y} = (y_1, ..., y_C)$ 是真实标签， $T_j$ 表示第 $j$ 层 AGC-LSTM 中的时间步数。第三项旨在对不同的关节给予同等的关注。最后一项用于限制感兴趣的节点数量。参数 $\lambda$ 和 $\beta$ 是权重衰减系数。注意，只有最后一个时间步的 $\hat{y}^g_{T_3}$ 和 $\hat{y}^l_{T_3}$ 的概率总和用于预测人类动作的类别。

尽管基于关节的 AGC-LSTM 网络已经取得了最先进的结果，我们还探索了该模型在局部层面上的表现。根据人体结构，身体可以分为几个部分。与基于关节的 AGC-LSTM 网络类似，我们首先通过线性层和共享的 LSTM 层捕获局部特征。然后将局部特征作为节点表示输入到三层 AGC-LSTM 中，来建模时空特征。结果表明，我们的模型在局部层面上也能取得优异的表现。此外，基于关节和部分的混合模型（如图 5 所示）可以进一步提高性能。