【论文阅读笔记】（2015 CVPR）Hierarchical recurrent neural network for skeleton based action recognition

小吴同学真棒

已于 2022-02-25 12:34:27 修改

阅读量1.9k

点赞数 1

分类专栏：学习人工智能文章标签：计算机视觉深度学习人工智能骨架点动作识别 HBRNN

于 2022-02-22 16:28:55 首次发布

本文链接：https://blog.csdn.net/qq_36627158/article/details/123070727

版权

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 5 订阅

订阅专栏

Hierarchical recurrent neural network for skeleton based action recognition

（2015 CVPR）

Authors

Notes

Contributions

We propose an end-to-end hierarchical RNN for skeleton based action recognition. Instead oftaking the whole skeleton as the input, we divide the human skeleton into five parts according to human physical structure, and then separately feed them to five subnets. As the number oflayers increases, the representations extracted by the subnets are hierarchically fused to be the inputs of higher layers. The final representations ofthe skeleton sequences are fed into a single-layer perceptron, and the temporally accumulated output of the perceptron is the final decision. We compare with five other deep RNN architectures derived from our model to verify the effectiveness of the proposed network, and also compare with several other methods on three publicly available datasets. Experimental results demonstrate that our model achieves the state-of-the-art performance with high computational efficiency

Method

Preliminaries. The output of a single hidden layer RNN can be derived as:

The output of a single hidden layer LSTM can be derived as:

The bidirectional recurrent neural network (BRNN) presents the sequence forwards and backwards to two separate recurrent hidden layers.

It should be noted that we can easily obtain LSTM-BRNN just by replacing the nonlinear units in the above Figure with LSTM blocks.

Architecture. According to human physical structure, the human skeleton can be decomposed into five parts, e.g., two arms, two legs and one trunk.

1、In the first layer , the five skeleton parts are fed into five corresponding bidirectionally recurrently connected subnets (BRNNs).

2、To model the neighboring skeleton parts, e.g., left arm-trunk, right arm-trunk, left leg-trunk, and right leg-trunk, we combine the representation of the trunk subnet with that of the other four subnets to obtain four new representations in the fusion layer . For the following fusion layer at time , the th newly concatenated representation as the input of the -th BRNN layer is

where denotes the concatenation operator, and are the hidden representations of the forward layer and backward layer of the -th part in the -th BRNN layer, and from the -th part in the -th layer.

3、Similar to the layer , these resulting four representations are separately fed into four BRNNs in the layer .

4、To model the upper and lower body, the representations of the left arm-trunk and right arm-trunk BRNNs are further combined to obtain the upper body representation while the representations of the left leg-trunk and right leg-trunk BRNNs are combined to obtain the lower body representation in the layer .

5、Finally, the newly obtained two representations are fed into two BRNNs in the layer , and the representations of these two BRNNs in the layer are fused again to represent the whole body in the layer .

6、The temporal dynamics of the whole body representation is further modelled by another BRNN in the layer . Noting that we just adopt LSTM neurons in the last recurrent layer (). The first three BRNN layers all use the tanh activation function.

7、From a viewpoint of feature learning, these stacked BRNNs can be considered to extract the spatial and temporal features of the skeleton sequences. After obtaining the final features of the skeleton sequence, a fully connected layer and a softmax layer sm are performed to classify the action. Combining and as the input to the fully connected layer , the output of the layer is

where are the connection weights from the forward and backward layers of to the layer .

8、Finally, the outputs of the layer are accumulated across the frame sequence, and the accumulated results are normalized by the softmax function to get each class probability :

Here there are classes of human actions.

The objective function of our model is to minimize the maximum-likelihood loss function :

where is the Kronecker function, and denotes the groundtruth label of the sequence . There are sequences in the training set .

Results

D：deep bidirectional RNN (DBRNN-L), which is directly stacked with several RNNs with the whole human skeleton as the input
H：Hierarchical
U：Unidirectional 单向的
B：Bidirectional 双向的
L：LSTM
T：with the tanh activation function in all layers.
the number 30 × 5 (LL1, HU.L) means that each unidirectional subnet in the first learnable layer of HURNN-L has 30 neurons
bl 的 Hidden size 表中的数字，里面HB.L和HU.L的【5、4、2、1】指的是 Hierarchical 的 BRNN 的数量

双向比单向的好
层级比使用整个骨架点作为输入的好
最后一层用 LSTM 比全用 tanh-RNN 的好

数字分别对应 Table 2 里最后一行的正确率
the dataset is divided into three action sets AS1, AS2 and AS3.
https://www.researchgate.net/figure/The-list-of-actions-in-three-subsets-AS1-AS2-and-AS3-of-the-MSR-Action-3D-dataset-59_tbl2_323631964
这个数据集总共有 20 个动作，每个 action set 只有 8 个动作

Motion Capture Dataset HDM05 数据集：
As stated in [4], some samples of these 130 actions should be classified into the same category, e.g., jogging starting from air and jog- ging starting from floor are the same action, jogging 2 steps and jogging 4 steps belong to the same “jogging” action. After sample combination, the actions are reduced to 65 categories.
这个数据集选择 10折交叉验证，图里展示的是其中 2 折验证的结果