【论文阅读笔记】(2015 CVPR)Hierarchical recurrent neural network for skeleton based action recognition

Hierarchical recurrent neural network for skeleton based action recognition

(2015 CVPR)

Authors

Notes

Contributions

We propose an end-to-end hierarchical RNN for skeleton based action recognition. Instead oftaking the whole skeleton as the input, we divide the human skeleton into five parts according to human physical structure, and then separately feed them to five subnets. As the number oflayers increases, the representations extracted by the subnets are hierarchically fused to be the inputs of higher layers. The final representations ofthe skeleton sequences are fed into a single-layer perceptron, and the temporally accumulated output of the perceptron is the final decision. We compare with five other deep RNN architectures derived from our model to verify the effectiveness of the proposed network, and also compare with several other methods on three publicly available datasets. Experimental results demonstrate that our model achieves the state-of-the-art performance with high computational efficiency


Method

Preliminaries. The output of a single hidden layer RNN can be derived as:

The output of a single hidden layer LSTM can be derived as:

The bidirectional recurrent neural network (BRNN) presents the sequence forwards and backwards to two separate recurrent hidden layers.

 It should be noted that we can easily obtain LSTM-BRNN just by replacing the nonlinear units in the above Figure with LSTM blocks.

Architecture. According to human physical structure, the human skeleton can be decomposed into five parts, e.g., two arms, two legs and one trunk.

1、In the first layer , the five skeleton parts are fed into five corresponding bidirectionally recurrently connected subnets (BRNNs).

2、To model the neighboring skeleton parts, e.g., left arm-trunk, right arm-trunk, left leg-trunk, and right leg-trunk, we combine the representation of the trunk subnet with that of the other four subnets to obtain four new representations in the fusion layer . For the following fusion layer  at time , the  th newly concatenated representation as the input of the -th BRNN layer  is

where  denotes the concatenation operator,  and  are the hidden representations of the forward layer and backward layer of the -th part in the -th BRNN layer,  and  from the -th part in the -th layer.

3、Similar to the layer , these resulting four representations are separately fed into four BRNNs in the layer .

4、To model the upper and lower body, the representations of the left arm-trunk and right arm-trunk BRNNs are further combined to obtain the upper body representation while the representations of the left leg-trunk and right leg-trunk BRNNs are combined to obtain the lower body representation in the layer .

5、Finally, the newly obtained two representations are fed into two BRNNs in the layer , and the representations of these two BRNNs in the layer are fused again to represent the whole body in the layer .

6、The temporal dynamics of the whole body representation is further modelled by another BRNN in the layer . Noting that we just adopt LSTM neurons in the last recurrent layer (). The first three BRNN layers all use the tanh activation function.

7、From a viewpoint of feature learning, these stacked BRNNs can be considered to extract the spatial and temporal features of the skeleton sequences. After obtaining the final features of the skeleton sequence, a fully connected layer  and a softmax layer sm are performed to classify the action. Combining  and  as the input to the fully connected layer , the output  of the layer  is

where  are the connection weights from the forward and backward layers of  to the layer .

 

8、Finally, the outputs of the layer  are accumulated across the  frame sequence, and the accumulated results  are normalized by the softmax function to get each class probability :

Here there are  classes of human actions.

The objective function of our model is to minimize the maximum-likelihood loss function :

where  is the Kronecker function, and  denotes the groundtruth label of the sequence . There are  sequences in the training set .


Results

  • D:deep bidirectional RNN (DBRNN-L), which is directly stacked with several RNNs with the whole human skeleton as the input
  • H:Hierarchical
  • U:Unidirectional 单向的
  • B:Bidirectional 双向的
  • L:LSTM
  • T:with the tanh activation function in all layers.
  • the number 30 × 5 (LL1, HU.L) means that each unidirectional subnet in the first learnable layer of HURNN-L has 30 neurons
  • bl 的 Hidden size 表中的数字,里面HB.L和HU.L的【5、4、2、1】指的是 Hierarchical 的 BRNN 的数量

  • 双向 比 单向 的好
  • 层级 比 使用整个骨架点作为输入 的好
  • 最后一层用 LSTM 比 全用 tanh-RNN 的好

  • 数字分别对应 Table 2 里最后一行的正确率
  • the dataset is divided into three action sets AS1, AS2 and AS3. 
  • https://www.researchgate.net/figure/The-list-of-actions-in-three-subsets-AS1-AS2-and-AS3-of-the-MSR-Action-3D-dataset-59_tbl2_323631964
  • 这个数据集总共有 20 个动作,每个 action set 只有 8 个动作

  •  Motion Capture Dataset HDM05 数据集:
  • As stated in [4], some samples of these 130 actions should be classified into the same category, e.g., jogging starting from air and jog- ging starting from floor are the same action, jogging 2 steps and jogging 4 steps belong to the same “jogging” action. After sample combination, the actions are reduced to 65 categories.
  • 这个数据集选择 10折交叉验证,图里展示的是其中 2 折验证的结果

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值