Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition
(2019 CVPR)
Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian
Notes
Contributions
- we propose the A-link inference module (AIM) to infer actional links which capture action-specific latent dependencies. The actional links are combined with structural links as generalized skeleton graphs.
- We propose the actional-structural graph convolution network (AS-GCN) to extract useful spatial and temporal information based on the multiple graphs.
- We introduce an additional future pose prediction head to predict future poses, which also improves the recognition performance by capturing more detailed action patterns.
- The AS-GCN outperforms several state-of-the-art meth- ods on two large-scale data sets; As a side product, AS- GCN is also able to precisely predict the future poses.
Method
Actional Links (A-links)
To capture richer dependencies, we introduce an encoder-decoder structure, called A-link inference module, to capture action-specific latent dependencies, i.e. actional links, directly from actions.
1、Encoder. The functionality of an encoder is to estimate the states of the A-links given the 3D joint positions across time; that is,
where C is the number of A-link types. The encoder produces A-links by propagating information between joints and links iteratively to learn link features.
2、Decoder. The functionality of the decoder to predict the future 3D joint positions conditioned on the A-links inferred by the encoder and previous poses; that is,
The decoder predict future joint positions based on the inferred A-links.
3、AGC. Given the input Xin, the AGC is
where W is the trainable weight to capture feature importance. Note that we use the AIM to warm-up A-links in the pretraining process; during the training of action recognition and pose prediction, the A-links are further optimized by forward-passing the encoder of AIM only.
Structural Links (S-links)
With the L-order polynomial, we define the structural graph convolution (SGC), which can directly reach the L-hop neighbors to increase the receptive field. The SGC is formulated as
where M and W are the trainable weights to capture edge weights and feature importance.
Actional-Structural Graph Convolution Block
AS-GCN (Backbone network)
Multitasking of AS-GCN
1、Action recognition head. To classify actions, we construct a recognition head following the backbone network. We apply the global averaging pooling on the joint and temporal dimensions of the feature maps output by the backbone network, and obtain the feature vector, which is finally fed into a softmax classifier to obtain the predicted class-label. The loss function for action recognition is the standard cross entropy loss
2、Future pose prediction head. To predict future poses, we construct a prediction module followed by the backbone network. We use several AS-GCN blocks to decode the high-level feature maps extracted from the historical data and obtain the predicted future 3D joint positions.
when we train the recognition head and future prediction head together, recognition performance gets improved.
Results