Human gait conveys significant information that can be used for identity recognition and emotion recognition. Recent studies have focused more on gait identity recognition than emotion recognition and regarded these two recognition tasks as independent and unrelated. How to train a unified model to effectively recognize the identity and emotion from gait at the same time is a novel and challenging problem. In this paper, we propose a novel Attention Enhanced Temporal Graph Convolutional Network (AT-GCN) for gait-based recognition and motion prediction. Enhanced by spatial and temporal attention, the proposed model can capture discriminative features in spatial dependency and temporal dynamics. We also present a multi-task learning architecture, which can jointly learn representations for multiple tasks. It helps the emotion recognition task with limited data considerably benefit from the identity recognition task and helps the recognition tasks benefit from the auxiliary prediction task. Furthermore, we present a new dataset (EMOGAIT) that consists of 1, 440 real gaits, annotated with identity and emotion labels. Experimental results on two datasets demonstrate the effectiveness of our approach and show that our approach achieves substantial improvements over mainstream methods for identity recognition and emotion recognition.

  1. Introduction

However, there is much less work on gait emotion recognition than gait identity recognition, mainly due to the lack of gait data or video annotated with emotion.



When we perform an identification task by observing a person’s walking video, we can also perceive the subject’s emotion and estimate the gait trend at the next moment based on the same observation.

The motivation for coupling identity and emotion also has a psychological basis. Prior works in psychology [16] discussed whether facial responses are shared between identity and emotion. They observed a strong positive correlation between face emotion recognition ability and face identity recognition ability, and the two recognition tasks share a common processing mechanism.



为啥面部情感和面部识别会有很强的联系??还共享一个处理机制? 我理解面部表情是主观的,识别是客观的,不能放在一起对比,这句话该怎么翻译呢?

Inspired by the information-sharing mechanism of multi-task learning, we design a multi-task setup architecture to divides the gait-related problem into sub-tasks.This architecture can jointly learn representations for three gait-based tasks: identity recognition, emotion recognition, and motion prediction, which helps the tasks with less data considerably benefit from other tasks with richer data.



下面紧接着介绍了深度学习的成功让姿态估计算法激增,从中获取人体步态的骨骼图和关节轨迹图(应该指的就是时空图,时间连接+空间连接)。又提到基于骨架previous work是使用传统的卷积网络,在处理图数据结构方面没有优势。引出graph neural networks (GCNs) have extended convolutional neural networks (CNNs) to graphs of arbitrary structure(任意结构). The application of GCNs to model dynamic(动态) graphs over gait sequences recognition is yet to be explored.

It can improve the skeleton representation by synchronously learning spatial and temporal characteristics.

接下来就是介绍了AT-GCN是通过将GCN和Attention扩展到Recurrent unit,通过同步学习空间和时间特征来改进骨架表现。

  1. Related Work


2.1 基于步态识别 gait-based

  1. 傅里叶:

B. Li, C. Zhu, S. Li, T. Zhu, Identifying emotions from non-contact gaits information based on microsoft kinects, IEEE Trans.

S. Li, L. Cui, C. Zhu, B. Li, N. Zhao, T. Zhu, Emotion recognition using kinect motion capture data of human gaits, PeerJ 4 (2016) 1–17,

  1. used joint angles of the human body from inverse kinematics(逆运动学) computation to recognize the five emotions of 32 observers:

Recognizing Emotions Conveyed by Human Gait

These works just considered the gait-based recognition task as independent and defined separated task-specific features and models for them.


2.2 基于骨骼识别 skeleton-based


Inspired by using spatial-temporal GCN models in these multiple applications, we propose a novel AT-GCN network to learn discriminative(有判别力的) representations(表征) from skeleton sequences for gait-based recognition.


2.3 多任务学习 multi-task learning

By sharing information between related tasks, multi-task learning will enable the model to generalize better on the tasks [17]. Multi-task learning in deep neural networks for gait-based recognition tasks has received a lot of attention in prior works.

Marín Jiménez et al. [37] trained a multi-task CNN model for gender recognition, age estimation, and gait recognition. It proved that by training more than one gait-based tasks jointly, the identification task converges faster than when it is trained independently. The recognition performance of multi-task models is equal or superior to more complex single-task ones. Papavasileiou et al. [38] applied multi-task feature learning to train the different classification tasks that can classify subjects’ gait and recognize multiple gait disorders, which can benefit the medical professionals and patients with improved and targeted treatment plans for rehabilitation. A recent work [39] used the powerful modeling capabilities of the GCN model and the message sharing mechanism between tasks to achieve consistently substantial improvement in action recognition from skeleton data.

Multi-tasks optimization has been shown to be theoretically and practically more effective than learning tasks individually [17]. Our work is the first to treat gait-based identity recognition and emotion recognition as related tasks and attempt to transfer the knowledge learned from the identity recognition task to the emotion recognition task by using multi-task learning models.


  1. Approch


3.1 pipeline


3.2 Pose trajectory generation


3.3 Attention enhanced temporal GCN

3.3.1 Temporal GCN


Temporal dependence modeling is another crucial problem for gait skeleton sequences.

However, compared to the LSTM cell, the GRU cell has a relatively simple structure, fewer parameters, and faster convergence ability. Therefore, we adopt the GRU cell as the basic network to learn the implicit information from the time intervals. Based on GRU, we propose the AT-GCN cell, which can explore the temporal dynamics and spatial correlation of nodes at the same time.


The GRU obtains the gait status at time t by taking the hidden status at time t − 1 and the current gait information as inputs.


These inputs are obtained with the graph convolution operator. For normal GRU cell, the reset gate is used to control the degree of ignoring the status information at the previous moment.

这些input通过图卷积操作获得,对普通GRU单元,reset gate 是用于控制是否忽略前一时刻状态信息。

The update gate determines the extent to which the status information at the previous time is brought into the current status.

Update gate决定前一时刻的状态信息是否加入此时刻

The memory content stored at time t and H′t is output state at time t. While capturing the gait information at the current moment, the model still retains the changing trend of historical information.

在t和 H′t时刻存储的记忆内容是t时刻的输出状态。在捕捉当前时刻的步态信息的同时,模型仍然保留了历史信息的变化趋势。

To explore the spatial structural information with different wights according to the different importance of nodes, we add spatial attention operator for H′t to get the final enhanced state H't . It has the ability to capture temporal dynamics and exploit spatial structural information at the same time. The functions of AT-GCN unit are defined as follows:


$$H_{t-1}$$: 在t-1时刻,隐藏状态(hidden state)

$$X_t$$: t时刻的骨骼步态信息

$$G(A,X_t)$$:eq2中的图卷积过程 $$u_t$$:update gate $$r_t$$:reset gate $$c_t$$:t时刻存储的记忆内容

Sigmoid function W,b:训练过程中的权重和bias


eq5 eq6 eq7?? eq6为什么是(1-ut)呢?是因为if not么?

看了LSTM和RNN的网络架构,GRU是比LSTM gate更少,结构更简单但性能更好的网络。但没有去了解公式,是不是出自这些网络结构的公式呢?

The proposed AT-GCN model can deal with the temporal dynamics by the GRU module and capture the spatial dependence of joints by attention enhanced GCN module.


3.3.2 Spatial attention

To enhance the discriminative power of the learned features, we introduce the spatiotemporal attention mechanism to our model. It can adaptively select dominant joints within each frame through the spatial attention module, and automatically measure the importance of different frames through the temporal attention module.



图5使用了GCN而不是Full Connection来retain(保留) joints dimension,所以在传统的GRU layer后,可以获得中间的hidden state H't,它包含了丰富的空间结构信息和时间动态。


3.3.3 Temporal attention

In the temporal dimension, the importance of the information provided by different frames is generally not equal. Some of the dominant frames contain the most discriminative information. The proposed temporal attention mechanism can adaptively attach different importance to data. As shown in Fig. 6, after AT-GCN networks, we get spatially attended hidden state features, which can


3.4 Multi-task learning structure

Our goal is to learn a deep model to automatically extract discriminative features from gait sequences that can be applied to three tasks: identity recognition, emotion recognition, and motion prediction.

Multi-task learning is a methodology that can boost the performance of an individual task by jointly exploiting commonalities between multiple related tasks, especially when the training set for some tasks is limited or unbalanced. Due to the lack of gait data or video annotated with emotion, we take advantage of multitask learning to transfer the knowledge learned from the other two tasks to the emotion recognition task.



3.4.1. Identity recognition task

Given a video sequence of a walking person, the identity recognition task is to assign a class of identity to that person from a set of C predefined ones. We will get fixed length embedding vectors with excellent discriminative properties through AT-GCN-based encoder structure, which is finally fed into a Softmax classifier to obtain the predicted class-label ˆ y. The loss function for action recognition is the standard cross-entropy loss given by the equation



3.4.2. Emotion recognition task

Given an input sequence, the goal of the emotion recognition task is to assign the sample with the label: happy, angry, sad and normal by using gait information. It is also a multiple classification problem, sharing the same embedding vectors as the recognition task after the temporal attention module. The last fully connected layer of the model contains four units, and the loss function $$L_{emo\enspace recog}$$ for this task is a multi-class cross-entropy loss similar to the Eq. (14) with C = 4.

给定一个输入序列,情绪识别任务的目标是通过使用步态信息给样本指定标签:快乐、愤怒、悲伤和中性。这也是一个多重分类问题,与时间注意模块后的识别任务共享相同的嵌入向量。模型的最后一个全连接层包含四个单元,该任务的损失函数$$L_{emo\enspace recog}$$是类似于eq.14的C=4的多类交叉熵损失。。

3.4.3. Motion prediction task

Given the historical trajectories of input sequence X1, X2, . . . , Xn−1, the goal of gait motion prediction is to forecast the next-frame skeleton positions X′ n. To predict future poses, we construct a prediction module followed by the backbone network. We use an AT-GCN block to decode the high-level feature vector extracted from the historical data. The feature vector then is fed into FC layers to obtain the predicted future joint positions. The loss function for future prediction is the standard $$L_2$$ loss.


3.4.4. Multi-task loss

The proposed multi-task learning approach aims to train a unified model to optimize two gait classification tasks (identity recognition and emotion recognition) and a gait regression task (motion prediction).


When learning multi-tasks simultaneously, the previous traditional approach adopts the simple way of combining multi-objective losses, that is, performing a weighted linear sum of each task’s loss


However, the performance of model is extremely sensitive to the weight $$w_i$$ selection.These weight hyper-parameters are expensive to tune, often taking many days for each trial. So we take a more sensible approach according to the work [42], which is able to learn the optimal task weights

然而,模型的性能对权重 $$w_i$$的选择极其敏感。这些权重超参数调整起来很昂贵,每次试验通常需要很多天。因此,我们根据工作[42]采取了一种更合理的方法,即能够学习最优任务权重



  1. Experiments

To measure the quality of the learned feature, we represent experiments and evaluation results compared to the state-of-art methods.

介绍了与步态识别有关的数据集,展示本文的实验结果。在TUM GAID数据集和他们自己提出的EMOGAIT数据集上进行了实验,并与SOTA对比。

4.1.1 Dataset

  1. TUM GAID:包含305个不同的受试者,两个实验。



In the test set, the recordings N1, N2, N3, N4 are used as the gallery set, and N5, N6, B1, B2, S1, S2 are used as the probe set. This dataset can be used for the identity recognition task and the motion prediction task to test the discriminative ability of the feature extracted from the AT-GCN network.



EMOGAIT consists of 1, 440 real gaits, annotated with identity and emotion labels.


It can be used to verify the validity of the multi-task learning structure by simultaneously training the three tasks. We train the models on the train set and report the top-1 accuracies on the test set.


4.1.2 Implement details

In our experiments, since each original image sequence has a different temporal length, we extract subsequences of fixed 32 frames sliding from the full-length sequences with an interval of three frames.

We also use the Batch Normalization technique to normalize the inputs to fight the internal covariate shift problem.



  1. Conclusion and future work


  1. 个人感悟

  1. 涉及的知识点:循环神经网络LSTM的变体GRU,注意力机制Attention,图卷积神经网络GCN,自编码器。

  2. 问问自己



















