解读Multi-task learning for gait-based identity recognition and emotion recognition using attention...

最新推荐文章于 2025-05-12 08:48:50 发布

多放香菜和辣椒

最新推荐文章于 2025-05-12 08:48:50 发布

阅读量409

点赞数

文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/weixin_44748991/article/details/130625429

版权

该论文提出了一种名为AT-GCN的模型，用于步态识别和情感识别的多任务学习。通过结合空间和时间注意力机制，模型能有效地捕获步态的时空特征。此外，建立的多任务架构使得情感识别任务可以从身份识别任务中受益。论文还介绍了一个新的数据集EMOGAIT，包含带标签的身份和情感信息。实验表明，AT-GCN在两个数据集上均优于现有方法，证明了其有效性和创新性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前排提示：博主水平较低，该篇文章还在思考中，很多待补充，发上来是为了方便自己复盘，也欢迎大家一起交流~~

使用注意力增强的时序卷积网络基于步态进行身份识别和情感识别的多任务学习

Abstract

Human gait conveys significant information that can be used for identity recognition and emotion recognition. Recent studies have focused more on gait identity recognition than emotion recognition and regarded these two recognition tasks as independent and unrelated. How to train a unified model to effectively recognize the identity and emotion from gait at the same time is a novel and challenging problem. In this paper, we propose a novel Attention Enhanced Temporal Graph Convolutional Network (AT-GCN) for gait-based recognition and motion prediction. Enhanced by spatial and temporal attention, the proposed model can capture discriminative features in spatial dependency and temporal dynamics. We also present a multi-task learning architecture, which can jointly learn representations for multiple tasks. It helps the emotion recognition task with limited data considerably benefit from the identity recognition task and helps the recognition tasks benefit from the auxiliary prediction task. Furthermore, we present a new dataset (EMOGAIT) that consists of 1, 440 real gaits, annotated with identity and emotion labels. Experimental results on two datasets demonstrate the effectiveness of our approach and show that our approach achieves substantial improvements over mainstream methods for identity recognition and emotion recognition.

Introduction

However, there is much less work on gait emotion recognition than gait identity recognition, mainly due to the lack of gait data or video annotated with emotion.

步态情感识别研究少主要原因是缺乏有情感标注的步态数据或视频

这里1、2段先介绍了步态情感识别的研究来源（心理学和一些研究），再和表情识别做了对比，就是夸一下这个步态情感识别研究很有意义，王婆卖瓜。

When we perform an identification task by observing a person’s walking video, we can also perceive the subject’s emotion and estimate the gait trend at the next moment based on the same observation.

The motivation for coupling identity and emotion also has a psychological basis. Prior works in psychology [16] discussed whether facial responses are shared between identity and emotion. They observed a strong positive correlation between face emotion recognition ability and face identity recognition ability, and the two recognition tasks share a common processing mechanism.

对一个人的行走视频可以进行身份识别，也可以同时识别受试者的情绪，并估计下一时刻的步态趋势。下面又介绍了下将身份和情绪耦合在一起这个动机也有心理学基础。

感觉有点牵强，心理学研究面部反应是否在身份和情感之间共享？

为啥面部情感和面部识别会有很强的联系？？还共享一个处理机制？我理解面部表情是主观的，识别是客观的，不能放在一起对比，这句话该怎么翻译呢？

Inspired by the information-sharing mechanism of multi-task learning, we design a multi-task setup architecture to divides the gait-related problem into sub-tasks.This architecture can jointly learn representations for three gait-based tasks: identity recognition, emotion recognition, and motion prediction, which helps the tasks with less data considerably benefit from other tasks with richer data.

然后再提到在机器学习里，多任务要比单任务能保留更多的内在信息。所以受到多任务学习的信息共享机制启发，该文章设计了一个多任务设置架构，将与步态相关的问题分解为子任务。

这样可以联合学习三个基于步态任务的表示：身份识别（main），情感识别（main）和运动预测（auxiliary），这可以帮助数据少的任务从数据丰富的任务中大大受益。

下面紧接着介绍了深度学习的成功让姿态估计算法激增，从中获取人体步态的骨骼图和关节轨迹图（应该指的就是时空图，时间连接+空间连接）。又提到基于骨架previous work是使用传统的卷积网络，在处理图数据结构方面没有优势。引出graph neural networks (GCNs) have extended convolutional neural networks (CNNs) to graphs of arbitrary structure（任意结构）. The application of GCNs to model dynamic(动态) graphs over gait sequences recognition is yet to be explored.

It can improve the skeleton representation by synchronously learning spatial and temporal characteristics.

接下来就是介绍了AT-GCN是通过将GCN和Attention扩展到Recurrent unit，通过同步学习空间和时间特征来改进骨架表现。

Related Work

回顾了相关工作，指出传统skeleton-based方法的缺点。

2.1 基于步态识别 gait-based

傅里叶：

B. Li, C. Zhu, S. Li, T. Zhu, Identifying emotions from non-contact gaits information based on microsoft kinects, IEEE Trans.

S. Li, L. Cui, C. Zhu, B. Li, N. Zhao, T. Zhu, Emotion recognition using kinect motion capture data of human gaits, PeerJ 4 (2016) 1–17,

used joint angles of the human body from inverse kinematics（逆运动学） computation to recognize the five emotions of 32 observers：

Recognizing Emotions Conveyed by Human Gait

These works just considered the gait-based recognition task as independent and defined separated task-specific features and models for them.

大致介绍了步态识别的方法，并说与STEP论文的研究类似。

2.2 基于骨骼识别 skeleton-based

CNN、LSTM、GCN、ST-GCN、GRU+GCN、LSTM+GCN

Inspired by using spatial-temporal GCN models in these multiple applications, we propose a novel AT-GCN network to learn discriminative（有判别力的） representations（表征） from skeleton sequences for gait-based recognition.

相比CNN，GCN对图结构的数据有优异的处理能力。

2.3 多任务学习 multi-task learning

By sharing information between related tasks, multi-task learning will enable the model to generalize better on the tasks [17]. Multi-task learning in deep neural networks for gait-based recognition tasks has received a lot of attention in prior works.

Marín Jiménez et al. [37] trained a multi-task CNN model for gender recognition, age estimation, and gait recognition. It proved that by training more than one gait-based tasks jointly, the identification task converges faster than when it is trained independently. The recognition performance of multi-task models is equal or superior to more complex single-task ones. Papavasileiou et al. [38] applied multi-task feature learning to train the different classification tasks that can classify subjects’ gait and recognize multiple gait disorders, which can benefit the medical professionals and patients with improved and targeted treatment plans for rehabilitation. A recent work [39] used the powerful modeling capabilities of the GCN model and the message sharing mechanism between tasks to achieve consistently substantial improvement in action recognition from skeleton data.

Multi-tasks optimization has been shown to be theoretically and practically more effective than learning tasks individually [17]. Our work is the first to treat gait-based identity recognition and emotion recognition as related tasks and attempt to transfer the knowledge learned from the identity recognition task to the emotion recognition task by using multi-task learning models.

大概是说了下多任务学习有哪些已有的研究，然后再次强调了其有效性（相对单任务

Approch

介绍了本文提出方法的具体实现细节，说明了注意力增强时序图卷积网络（AT-GCN）的架构

3.1 pipeline

通过姿态估计算法提取骨架信息，然后将2D的关节坐标拼接。该网络是基于AT-GCN的自编码器结构，input是归一化的骨架图。编码器Encoder将骨架图转成固定长度的嵌入向量，用来身份识别和情感识别。解码器将这个向量转换成运动预测输出的图向量。总损失函数是每个任务的损失函数之和，然后总体BP。

3.2 Pose trajectory generation

使用Openpose，获取16个。

3.3 Attention enhanced temporal GCN

3.3.1 Temporal GCN

这两个ST-GCN原文里出现过，有提到可从GCN追溯。这里有点奇怪，标题是时间图卷积，但是用的是空间图卷积的公式？

Temporal dependence modeling is another crucial problem for gait skeleton sequences.

However, compared to the LSTM cell, the GRU cell has a relatively simple structure, fewer parameters, and faster convergence ability. Therefore, we adopt the GRU cell as the basic network to learn the implicit information from the time intervals. Based on GRU, we propose the AT-GCN cell, which can explore the temporal dynamics and spatial correlation of nodes at the same time.

对比了下LSTM和GRU，最后选择GRU，基于GRU提出AT-GCN单元。

The GRU obtains the gait status at time t by taking the hidden status at time t − 1 and the current gait information as inputs.

GRU通过将当前时刻的步态信息和t-1时刻的隐藏层状态作为input来获得t时刻的步态状态。

These inputs are obtained with the graph convolution operator. For normal GRU cell, the reset gate is used to control the degree of ignoring the status information at the previous moment.

这些input通过图卷积操作获得，对普通GRU单元，reset gate 是用于控制是否忽略前一时刻状态信息。

The update gate determines the extent to which the status information at the previous time is brought into the current status.

Update gate决定前一时刻的状态信息是否加入此时刻

The memory content stored at time t and H′t is output state at time t. While capturing the gait information at the current moment, the model still retains the changing trend of historical information.

在t和 H′t时刻存储的记忆内容是t时刻的输出状态。在捕捉当前时刻的步态信息的同时，模型仍然保留了历史信息的变化趋势。

To explore the spatial structural information with different wights according to the different importance of nodes, we add spatial attention operator for H′t to get the final enhanced state H't . It has the ability to capture temporal dynamics and exploit spatial structural information at the same time. The functions of AT-GCN unit are defined as follows:

为了探索具有不同权重的空间结构信息，根据节点的不同重要性，我们为H′t添加了空间attention算子，以获得最终的增强状态H't。它能够捕捉时间动态，同时利用空间结构信息。AT-GCN单元的函数定义如下：

$$H_{t-1}$$: 在t-1时刻，隐藏状态（hidden state）

$$X_t$$: t时刻的骨骼步态信息

$$G(A,X_t)$$：eq2中的图卷积过程 $$u_t$$:update gate $$r_t$$:reset gate $$c_t$$:t时刻存储的记忆内容

Sigmoid function W，b：训练过程中的权重和bias

$$f_{att}$$：一种注意力网络，可以学习图节点的空间权重

eq5 eq6 eq7？？ eq6为什么是（1-ut)呢？是因为if not么？

看了LSTM和RNN的网络架构，GRU是比LSTM gate更少，结构更简单但性能更好的网络。但没有去了解公式，是不是出自这些网络结构的公式呢？

The proposed AT-GCN model can deal with the temporal dynamics by the GRU module and capture the spatial dependence of joints by attention enhanced GCN module.

AT-GCN模型可以通过GRU处理时间动态，并通过注意力增强的GCN捕捉关节的空间依赖。也就是GRU处理时间连接，注意力增强的GCN处理空间连接。和STEP相比，是用ST-GCN（单纯的TCN+GCN）处理时间和空间。

3.3.2 Spatial attention

To enhance the discriminative power of the learned features, we introduce the spatiotemporal attention mechanism to our model. It can adaptively select dominant joints within each frame through the spatial attention module, and automatically measure the importance of different frames through the temporal attention module.

为了增强学习特征的判别力，我们在模型中引入了时空注意机制。它可以通过空间注意力模块自适应地选择每帧内的主导关节，并通过时间注意力模块自动测量不同帧的权重。

下面又提到骨架图数据的每一帧都由关节的2D坐标表示。所提出的空间注意力模型可以利用关节的不同权重来聚焦那些有判别力的关节。

图5使用了GCN而不是Full Connection来retain（保留） joints dimension，所以在传统的GRU layer后，可以获得中间的hidden state H't，它包含了丰富的空间结构信息和时间动态。

这些公式没看懂，类比成WX+b好像又能有点理解？eq8和eq9是K个关节共同权重表示？

3.3.3 Temporal attention

In the temporal dimension, the importance of the information provided by different frames is generally not equal. Some of the dominant frames contain the most discriminative information. The proposed temporal attention mechanism can adaptively attach different importance to data. As shown in Fig. 6, after AT-GCN networks, we get spatially attended hidden state features, which can

在时间维度上，不同帧提供的信息权重通常不相等。一些主帧包含最具鉴别力的信息。所提出的时间注意机制可以自适应地对数据赋予不同的权重。如图6所示，在AT-GCN网络后，我们得到了空间参与的隐藏状态特征，可以用$$H_i$$表示。然后，我们在这些隐藏状态上添加时间注意力模块，由$$S_i$$给出。

3.4 Multi-task learning structure

Our goal is to learn a deep model to automatically extract discriminative features from gait sequences that can be applied to three tasks: identity recognition, emotion recognition, and motion prediction.

Multi-task learning is a methodology that can boost the performance of an individual task by jointly exploiting commonalities between multiple related tasks, especially when the training set for some tasks is limited or unbalanced. Due to the lack of gait data or video annotated with emotion, we take advantage of multitask learning to transfer the knowledge learned from the other two tasks to the emotion recognition task.

目标是学习一个深度模型，从步态序列中自动提取区分特征，该特征可应用于三项任务：身份识别、情感识别和运动预测。

多任务学习是一种方法，它可以通过联合利用多个相关任务之间的共性来提高单个任务的性能，特别是当一些任务的训练集有限或不平衡时。由于缺乏带有情感注释的步态数据或视频，我们利用多任务学习将从其他两个任务中学到的知识转移到情感识别任务中。

3.4.1. Identity recognition task

Given a video sequence of a walking person, the identity recognition task is to assign a class of identity to that person from a set of C predefined ones. We will get fixed length embedding vectors with excellent discriminative properties through AT-GCN-based encoder structure, which is finally fed into a Softmax classifier to obtain the predicted class-label ˆ y. The loss function for action recognition is the standard cross-entropy loss given by the equation

给定一个人行走的视频序列，身份识别任务是从一组预定义的C中为该人指定一类身份。通过基于AT-GCN的encoder结构，我们将获得具有优异判别性能的固定长度嵌入向量，最后将其输入Softmax分类器，以获得预测的类别标签y_hat。动作识别的损失函数是由等式给出的标准交叉熵损失

ground-truth：真值，真实的有效值，正确的标注（label）

3.4.2. Emotion recognition task

Given an input sequence, the goal of the emotion recognition task is to assign the sample with the label: happy, angry, sad and normal by using gait information. It is also a multiple classification problem, sharing the same embedding vectors as the recognition task after the temporal attention module. The last fully connected layer of the model contains four units, and the loss function $$L_{emo\enspace recog}$$ for this task is a multi-class cross-entropy loss similar to the Eq. (14) with C = 4.

给定一个输入序列，情绪识别任务的目标是通过使用步态信息给样本指定标签：快乐、愤怒、悲伤和中性。这也是一个多重分类问题，与时间注意模块后的识别任务共享相同的嵌入向量。模型的最后一个全连接层包含四个单元，该任务的损失函数$$L_{emo\enspace recog}$$是类似于eq.14的C=4的多类交叉熵损失。。

3.4.3. Motion prediction task

Given the historical trajectories of input sequence X1, X2, . . . , Xn−1, the goal of gait motion prediction is to forecast the next-frame skeleton positions X′ n. To predict future poses, we construct a prediction module followed by the backbone network. We use an AT-GCN block to decode the high-level feature vector extracted from the historical data. The feature vector then is fed into FC layers to obtain the predicted future joint positions. The loss function for future prediction is the standard $$L_2$$ loss.

给定输入序列$$X^1$$、$$X^2$$，...，$$X^{n-1}$$的历史轨迹，步态运动预测的目标是预测下一帧骨架位置X’n。为了预测未来的姿态，我们在主干网络后构建了一个预测模块。我们使用AT-GCN块对从历史数据中提取的高级特征向量进行decode。然后将特征向量馈送到FC层，以获得预测的未来关节位置。未来预测的损失函数是标准的$$L_2$$损失。

3.4.4. Multi-task loss

The proposed multi-task learning approach aims to train a unified model to optimize two gait classification tasks (identity recognition and emotion recognition) and a gait regression task (motion prediction).

所提出的多任务学习方法旨在训练一个统一的模型来优化两个步态分类任务（身份识别和情感识别）和一个步态回归任务（运动预测）。

When learning multi-tasks simultaneously, the previous traditional approach adopts the simple way of combining multi-objective losses, that is, performing a weighted linear sum of each task’s loss

在同时学习多个任务时，之前传统的方法是采用简单的多目标损失组合方法，即对每个任务的损失进行加权线性求和

However, the performance of model is extremely sensitive to the weight $$w_i$$ selection.These weight hyper-parameters are expensive to tune, often taking many days for each trial. So we take a more sensible approach according to the work [42], which is able to learn the optimal task weights

然而，模型的性能对权重 $$w_i$$的选择极其敏感。这些权重超参数调整起来很昂贵，每次试验通常需要很多天。因此，我们根据工作[42]采取了一种更合理的方法，即能够学习最优任务权重

没看懂

其中σp、σi、σe分别是每个任务各自的观测噪声标量，假设模型输出误差服从高斯分布。回归任务和分类任务具有不同的权重形式，因为它们的损失具有不同的噪声模式。

Experiments

To measure the quality of the learned feature, we represent experiments and evaluation results compared to the state-of-art methods.

介绍了与步态识别有关的数据集，展示本文的实验结果。在TUM GAID数据集和他们自己提出的EMOGAIT数据集上进行了实验，并与SOTA对比。

4.1.1 Dataset

TUM GAID：包含305个不同的受试者，两个实验。

第一个实验的目标是使用每个人的10个步态序列来识别305个不同的受试者。在三种不同的协变量条件下记录这些序列：正常行走(N)、背着背包行走(B)、穿着涂层鞋行走(S)。

第二个实验是使用每个人的20个步态序列来识别32个受试者。其中10个在1月拍摄，另外10个在4月拍摄。最后十个序列包括三种情况：正常行走（TN）、背着背包行走（TB）、穿着涂层鞋行走（TS）。还将数据库分成三个分区：100个受试者用于训练，50个受试者用于验证，其余155个受试者用于测试。

In the test set, the recordings N1, N2, N3, N4 are used as the gallery set, and N5, N6, B1, B2, S1, S2 are used as the probe set. This dataset can be used for the identity recognition task and the motion prediction task to test the discriminative ability of the feature extracted from the AT-GCN network.

probe字面意思是探针，gallery为画廊。可以把它们分别理解为验证集和注册集。probe字面意思是探针，gallery为画廊。可以把它们分别理解为验证集和注册集。

EMOGAIT：

EMOGAIT consists of 1, 440 real gaits, annotated with identity and emotion labels.

1440个真实步态组成，标注身份和情感标签。60个受试者，RGB视频，4种情绪。为每种情绪下的每个对象收集六组骨架序列，使用姿态估计方法从视频中提取。五组序列用于训练，剩下的一组用于测试。该数据集分为训练集（1200）和测试集（240）。

It can be used to verify the validity of the multi-task learning structure by simultaneously training the three tasks. We train the models on the train set and report the top-1 accuracies on the test set.

通过同时训练三个任务，可以验证多任务学习结构的有效性。我们在训练集上训练模型，并记录测试集上的top-1

4.1.2 Implement details

In our experiments, since each original image sequence has a different temporal length, we extract subsequences of fixed 32 frames sliding from the full-length sequences with an interval of three frames.

We also use the Batch Normalization technique to normalize the inputs to fight the internal covariate shift problem.