论文阅读:A new representation of skeleton sequences for 3d action recognition

A new representation of skeleton sequences for 3d action recognition

(2017 CVPR)

Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid Boussaid

Notes

 

Contributions

  1. We propose to transform each skeleton sequence to a new representation, i.e., three clips, to allow global long-term temporal modelling of the skeleton sequence by using deep CNNs to learn hierarchical features from frame images.
  2. We introduce a Multi-Task Learning Network (MTLN) to process all the CNN features of the frames in the generated clips, thus to learn the spatial structure and the temporal information of the skeleton sequence. The MTLN improves the performance by utilizing intrinsic relationships among different frames of the generated clips.

 


 

Method

Clip Generation

we propose to represent the temporal dynamics of the skeleton sequence in a frame image, and then use multiple frames to incorporate different spatial relationships between the joints.

Specifically, for a skeleton sequence,

  1. the skeleton joints of each frame are first arranged as a chain by concatenating the joints of each body part.
  2. Then, four reference joints, namely, the left shoulder, the right shoulder, the left hip and the right hip, are respectively used to compute relative positions of the other joints, thus to incorporate different spatial relationships between joints and provide useful structural information of the skeleton.
  3. By combing the relative joints of all the frames, four 2D arrays with dimension (m−1) × t are generated (m is the number of skeleton joints in each frame and t is the number of frames of the skeleton sequence). The relative positions of joints in the 2D arrays are originally described with 3D Cartesian coordinates.
  4. Considering that the cylindrical coordinates are more useful to analyse the motions as each human body utilizes pivotal joint movements to perform an action, the 3D Cartesian coordinates are transformed to cylindrical coordinates in the proposed representation of skeleton sequences.
  5. The four 2D arrays corresponding to the same channel of the 3D cylindrical coordinates are transformed to four gray images by scaling the coordinate values between 0 to 255 using a linear transformation. A clip is then constructed with the four gray images. Consequently, three clips are generated from the three channels of the 3D coordinates of the four 2D arrays.

 

 

Clip Learning

Each frame of the generated clips describes the temporal dynamics of all frames of the skeleton sequence and one particular spatial relationship between the skeleton joints in one channel of the cylindrical coordinates. Different frames of the generated clip describe different spatial relationships and there exists intrinsic relationships among them.

       A deep CNN is first leveraged to extract a compact representation from each frame of the generated clips to exploit the long-term temporal information of the skeleton sequence. Given the generated clips, the CNN feature of each frame is extracted with the pre-trained VGG19 model. Each frame image of the three clips is scaled to 224 × 224, and is then duplicated three times to formulate a color image, so that it can be fed to the network. The output of the convolutional layer conv5_1 is used as the representation of the input frame, which is a 3D tensor with size 14×14×512, i.e., 512 feature maps with size 14 × 14.

 

 

Temporal Pooling of CNN Feature Maps

The rows of the generated frame correspond to different frames of a skeleton sequence. The temporal information of the sequence can thus be extracted from the row features of the feature maps. More specifically, the feature maps are processed with temporal mean pooling with kernel size 14×1, i.e., the pooling is applied over the temporal, or row dimension, thus to generate a compact fusion representation from all temporal stages of the skeleton sequence. The outputs of all feature maps (512) are concatenated to form a 7168D (14 × 512 = 7168) feature vector, which represents the temporal dynamics of the skeleton sequence in one channel of the cylindrical coordinates.

 

 

Multi-Task Learning Network (MTLN)

The four feature vectors have intrinsic relationships between each other. An MTLN is then proposed to jointly process the four feature vectors to utilize their intrinsic relationships for action recognition. The classification of each feature vector is treated as a separate task with the same classification label of the skeleton sequence.

       During training, the loss value of each task is individually com- puted using its own class scores. Then the loss values of all tasks are summed up to define the total loss of the network which is then used to learn the network parameters. During testing, the class scores of all tasks are averaged to form the final prediction of the action class.

 


 

Results

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值