论文阅读:Self-supervised video representation learning with space-time cubic puzzles

论文名称:Self-supervised video representation learning with space-time cubic puzzles(2019 AAAI)

论文作者:Dahun Kim, Donghyeon Cho, In So Kweon

下载地址https://ojs.aaai.org/index.php/AAAI/article/view/4873

 


 

Contributions

In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs (3D ResNet-18) using large scale video dataset. Given a randomly permuted 3D spatio-temporal crops extracted from each video clips, we train a network to predict their original spatio-temporal arrangement. By completing Space-Time Cubic Puzzles, the network learns self-supervised video representation from unlabeled videos. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

 


 

Method

1、Pretext Task: Space-Time Cubic Puzzles

To generate the puzzle pieces, we consider a spatio- temporal cuboid consisting of 2 × 2 × 4 grid cells for each video. Then, we sample 4 crops instead of 16, in either spatial or temporal dimension. More specifically, the 3D crops are extracted from a 4-cell grid of shape 2×2×1 (colored in blue) or 1 × 1 × 4 (colored in red) along the spatial or temporal dimension respectively.

 

 

2、Network

We use a late-fusion architecture. It is a 4-tower siamese network, where the towers share the same parameters, and follow the 3D ResNet architecture. Each 3D crops are processed separately until the fully-connected layer. Furthermore, each towers are agnostic of whether it was spatial or temporal dimension the input crops had been sampled from. Similar to the jigsaw puzzle problem, we formulate the rearrangement problem as a multi-class classification task. In practice, for each tuple of four crops, we flip all the frames upside-down with 50% probability, doubling the number of classes to 48 (that is, 2×4!) to further boost our performance.

 

 

3、Avoiding Trivial Learning

When designing a pretext task, it is crucial to ensure that the task forces the network to learn the desired semantic structure, without bypassing the understanding by finding low-level clues that reveal the location of a video crop. So, we choose channel replication as our data preprocessing. Another often-cited worry in all context-based works relates to trivial low-level boundary pattern completion. Thus, we apply spatio-temporal jittering when extracting each video crops from the grid cells to avoid the trivial cases

 

 

4、Implementation Details

We use video clips with 224 × 224 pixel frames and convert every video file into PNG images in our experiments. We sample 128 consecutive frames from each clip, and split them into 2 × 2 × 4-cell grid; That is, one grid cell consists of 112 × 112 × 32 pixels, and for each cell, we sample 80×80×16 pixels with random jittering to generate a 3D video crop. During the fine-tuning and testing, we randomly sample 16 consecutive frames for each clip, and spatially resize the frames at 112×112 pixels. In testing, we adopt the sliding window manner to generate input clips, so that each video is split into non-overlapped 16-frame clips. The clip class scores are averaged over all the clips of the video.

 


 

Results

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值