【论文笔记】ST-TR:Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks

该文介绍了ST-TR,一种基于时空变换网络的行为识别方法,侧重于从3D骨骼数据中捕获关节运动模式及其相关性。通过空间自注意力(SSA)和时间自注意力(TSA)模块,模型能够动态建模骨骼关节间的依赖关系,提高行为识别的准确性。SSA关注帧内不同身体部位的交互,而TSA则关注关节随时间的变化。ST-TR结合了两流结构,S-TR在空间流中应用SSA,T-TR在时间流中应用TSA,进一步提升特征提取能力。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks

基于骨骼通过时空变换网络的行为识别


未解决的问题:有效编码3D骨骼下面的潜在信息,尤其是从关节运动模式以及其相关性中提取有效信息时,诸如“拍手”之类的动作在人体骨骼中未链接的身体关节之间的相关性(例如,左手和右手)也被低估了。

时空变换网络ST-TR:

  • Transformer self-attention operator。基于双流的Transformer-based模型,空间时间维度都采用了self-attention对关节之间依赖关系建模

  • spatial self-attention(SSA):用于了解不同身体部位之间的帧内交互;动态建立骨骼关节之间的链接,代表人体各部分之间的关系,有条件地取决于动作,并且独立于自然的人体结构

  • temporal self-attention(TSA):用于对帧内相关性进行建模;研究关节随时间的动力学

使用改良的变换自注意运算符:

在这里插入图片描述

Sangwoo Cho, Muhammad Maqbool, Fei Liu, and Hassan
Foroosh. Self-attention network for skeleton-based human
action recognition. 2020. 2, 9, 10还提出了一个自我注意网络(SAN)来提取长期语义信息。但是,由于它专注于时间分割的片段,因此只能部分解决卷积的局 限性。

ST-GCN:

fout=∑kKs(finAk)Wk{\bf f}_{out}=\sum_{k}^{K_s}({\bf f}_{in}{\bf A}_k){\bf W}_kfout=kKs(finAk)Wk

Ak=Dk−12(A~k+I)Dk−12,Dii=∑kKs(A~kij+Iij){\bf A}_k={\bf D}_k^{-\frac{1}{2}}(\tilde{\bf A}_k+{\bf I}){\bf D}_k^{-\frac{1}{2}},D_{ii}=\sum_{k}^{K_s}(\tilde{\bf A}_k^{ij}+{\bf I}_{ij})Ak=Dk21(A~k+I)D

### Skeleton-Based Action Recognition Using Adaptive Cross-Form Learning In the realm of skeleton-based action recognition, adaptive cross-form learning represents a sophisticated approach that integrates multiple modalities to enhance performance. This method leverages both spatial and temporal information from skeletal data while adapting dynamically across different forms or representations. The core concept involves constructing an end-to-end trainable framework where features extracted from joint coordinates are transformed into various intermediate representations such as graphs or sequences[^1]. These diverse forms capture distinct aspects of human motion patterns effectively: - **Graph Representation**: Models interactions between joints by treating them as nodes connected via edges representing bones. - **Sequence Modeling**: Treats each frame's pose estimation results as elements within time-series data suitable for recurrent neural networks (RNN). Adaptive mechanisms allow seamless switching among these forms based on their suitability at different stages during training/inference processes. Specifically designed modules learn when and how much weight should be assigned to specific transformations ensuring optimal utilization of available cues without overfitting any single modality. For implementation purposes, one might consider employing Graph Convolutional Networks (GCNs) alongside Long Short-Term Memory units (LSTMs). GCNs excel in capturing structural dependencies present within graph structures derived from skeletons; meanwhile LSTMs handle sequential modeling tasks efficiently handling long-range dependencies found along video frames' timelines. ```python import torch.nn as nn class AdaptiveCrossFormModule(nn.Module): def __init__(self): super(AdaptiveCrossFormModule, self).__init__() # Define components responsible for processing individual form types here def forward(self, input_data): # Implement logic determining which transformation path(s) will process 'input_data' pass def train_model(model, dataset_loader): criterion = nn.CrossEntropyLoss() optimizer = ... # Initialize appropriate optimization algorithm for epoch in range(num_epochs): running_loss = 0.0 for inputs, labels in dataset_loader: outputs = model(inputs) loss = criterion(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() running_loss += loss.item() ```
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值