目录
Joint Training of the Networks
Regularized Objective Function
论文名称:An end-to-end spatio-temporal attention model for human action recognition from skeleto(2017 AAAI)
下载地址:https://arxiv.org/pdf/1611.06067v1.pdf
创新点(Main Contributions)
作者提出了一种使用注意力机制去学习骨架点数据时间-空间特征的框架,来做动作识别的任务。
整个框架是由三部分组成,主 LSTM 网络、空间维度上的注意力子网 和 时间维度上的注意力子网。
其中,在空间维度上的注意力子网中,作者使用其中的 LSTM 网络来学习当前帧节点和之前的帧节点之间的关系,形成对当前输入帧关节点数据的 attention map,自动挖掘出当前帧数据里哪些骨架点对动作识别的影响最大;
在时间维度上的注意力子网中,作者使用其中的 LSTM 网络来学习当前帧和之前的帧之间的关系,形成对当前输入帧数据的 attention map,自动学习哪些视频帧对动作识别的贡献最大。
此外,作者采用一种交替的联合训练方式来训练网络,并设计了一个正则化的损失函数来防止模型训练得过拟合。
Proposed Method
Spatial Attention
在每个时间戳(time step) t,输入为:
the scoresfor indicating the importance of the K joints, and they are jointly obtained as
For the k th joint, the activation as the joint-selection gate is computed as:
Instead of assigning equal degrees of importance to all the joints , the input to the main LSTM network is modulated to
Temperal Attention
The activation as the frame-selection gate can be computed as
For the sequence level classification, based on the output of the main LSTM network and the temporal attention value at each time step t.
the scores for C classes are the weighted summation of the scores at all time steps.
其中,
The predicted probability being the i th class given a sequence X is
Joint Training of the Networks
Regularized Objective Function
The scalars λ1, λ2, and λ3 balance the contribution of the three regularization terms.