论文名称:Self-supervised video representation learning with space-time cubic puzzles(2019 AAAI)
论文作者:Dahun Kim, Donghyeon Cho, In So Kweon
下载地址:https://ojs.aaai.org/index.php/AAAI/article/view/4873
Contributions
In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs (3D ResNet-18) using large scale video dataset. Given a randomly permuted 3D spatio-temporal crops extracted from each video clips, we train a network to predict their original spatio-temporal arrangement. By completing Space-Time Cubic Puzzles, the network learns self-supervised video representation from unlabeled videos. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.
Method
1、Pretext Task: Space-Time Cubic Puzzles
To generate the puzzle pieces, we consider a spatio- temporal cuboid consisting of 2 × 2 × 4 grid cells for each video. Then, we sample 4 crops instead of 16, in either spatial or temporal dimension. More specifically, the 3D crops are extracted from a 4-cell grid of shape 2×2×1 (colored in blue) or 1 × 1 × 4 (colored in red) along the spatial or temporal dimension respectively.
2、Network
We use a late-fusion architecture. It is a 4-tower siamese network, where the towers share the same parameters, and follow the 3D ResNet architecture. Each 3D crops are processed separately until the fully-connected layer. Furthermore, each towers are agnostic of whether it was spatial or temporal dimension the input crops had been sampled from. Similar to the jigsaw puzzle problem, we formulate the rearrangement problem as a multi-class classification task. In practice, for each tuple of four crops, we flip all the frames upside-down with 50% probability, doubling the number of classes to 48 (that is, 2×4!) to further boost our performance.
3、Avoiding Trivial Learning
When designing a pretext task, it is crucial to ensure that the task forces the network to learn the desired semantic structure, without bypassing the understanding by finding low-level clues that reveal the location of a video crop. So, we choose channel replication as our data preprocessing. Another often-cited worry in all context-based works relates to trivial low-level boundary pattern completion. Thus, we apply spatio-temporal jittering when extracting each video crops from the grid cells to avoid the trivial cases
4、Implementation Details
We use video clips with 224 × 224 pixel frames and convert every video file into PNG images in our experiments. We sample 128 consecutive frames from each clip, and split them into 2 × 2 × 4-cell grid; That is, one grid cell consists of 112 × 112 × 32 pixels, and for each cell, we sample 80×80×16 pixels with random jittering to generate a 3D video crop. During the fine-tuning and testing, we randomly sample 16 consecutive frames for each clip, and spatially resize the frames at 112×112 pixels. In testing, we adopt the sliding window manner to generate input clips, so that each video is split into non-overlapped 16-frame clips. The clip class scores are averaged over all the clips of the video.
Results