论文阅读：A new representation of skeleton sequences for 3d action recognition

最新推荐文章于 2021-09-01 11:15:03 发布

小吴同学真棒

最新推荐文章于 2021-09-01 11:15:03 发布

阅读量275

点赞数

分类专栏：学习人工智能文章标签： CNN 骨架点视频动作识别计算机视觉深度学习

本文链接：https://blog.csdn.net/qq_36627158/article/details/117702053

版权

骨架序列深度CNN 多任务学习网络时间建模动作识别

关键词由CSDN通过智能技术生成

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 4 订阅

订阅专栏

A new representation of skeleton sequences for 3d action recognition

（2017 CVPR）

Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid Boussaid

Notes

Contributions

We propose to transform each skeleton sequence to a new representation, i.e., three clips, to allow global long-term temporal modelling of the skeleton sequence by using deep CNNs to learn hierarchical features from frame images.
We introduce a Multi-Task Learning Network (MTLN) to process all the CNN features of the frames in the generated clips, thus to learn the spatial structure and the temporal information of the skeleton sequence. The MTLN improves the performance by utilizing intrinsic relationships among different frames of the generated clips.

Method

Clip Generation

we propose to represent the temporal dynamics of the skeleton sequence in a frame image, and then use multiple frames to incorporate different spatial relationships between the joints.

Specifically, for a skeleton sequence,

the skeleton joints of each frame are first arranged as a chain by concatenating the joints of each body part.
Then, four reference joints, namely, the left shoulder, the right shoulder, the left hip and the right hip, are respectively used to compute relative positions of the other joints, thus to incorporate different spatial relationships between joints and provide useful structural information of the skeleton.
By combing the relative joints of all the frames, four 2D arrays with dimension (m−1) × t are generated (m is the number of skeleton joints in each frame and t is the number of frames of the skeleton sequence). The relative positions of joints in the 2D arrays are originally described with 3D Cartesian coordinates.
Considering that the cylindrical coordinates are more useful to analyse the motions as each human body utilizes pivotal joint movements to perform an action, the 3D Cartesian coordinates are transformed to cylindrical coordinates in the proposed representation of skeleton sequences.
The four 2D arrays corresponding to the same channel of the 3D cylindrical coordinates are transformed to four gray images by scaling the coordinate values between 0 to 255 using a linear transformation. A clip is then constructed with the four gray images. Consequently, three clips are generated from the three channels of the 3D coordinates of the four 2D arrays.

Clip Learning

Each frame of the generated clips describes the temporal dynamics of all frames of the skeleton sequence and one particular spatial relationship between the skeleton joints in one channel of the cylindrical coordinates. Different frames of the generated clip describe different spatial relationships and there exists intrinsic relationships among them.

A deep CNN is first leveraged to extract a compact representation from each frame of the generated clips to exploit the long-term temporal information of the skeleton sequence. Given the generated clips, the CNN feature of each frame is extracted with the pre-trained VGG19 model. Each frame image of the three clips is scaled to 224 × 224, and is then duplicated three times to formulate a color image, so that it can be fed to the network. The output of the convolutional layer conv5_1 is used as the representation of the input frame, which is a 3D tensor with size 14×14×512, i.e., 512 feature maps with size 14 × 14.

Temporal Pooling of CNN Feature Maps

The rows of the generated frame correspond to different frames of a skeleton sequence. The temporal information of the sequence can thus be extracted from the row features of the feature maps. More specifically, the feature maps are processed with temporal mean pooling with kernel size 14×1, i.e., the pooling is applied over the temporal, or row dimension, thus to generate a compact fusion representation from all temporal stages of the skeleton sequence. The outputs of all feature maps (512) are concatenated to form a 7168D (14 × 512 = 7168) feature vector, which represents the temporal dynamics of the skeleton sequence in one channel of the cylindrical coordinates.

Multi-Task Learning Network (MTLN)

The four feature vectors have intrinsic relationships between each other. An MTLN is then proposed to jointly process the four feature vectors to utilize their intrinsic relationships for action recognition. The classification of each feature vector is treated as a separate task with the same classification label of the skeleton sequence.

During training, the loss value of each task is individually com- puted using its own class scores. Then the loss values of all tasks are summed up to define the total loss of the network which is then used to learn the network parameters. During testing, the class scores of all tasks are averaged to form the final prediction of the action class.

Results

小吴同学真棒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：A new representation of skeleton sequences for 3d action recognition

A new representation of skeleton sequences for 3d action recognition（2017 CVPR）Qiuhong Ke, Mohammed Bennamoun, Senjian An, Ferdous Sohel, Farid BoussaidNotesContributionsWe propose to transform each skeleton sequence to a new representation, i.
复制链接

扫一扫

专栏目录