论文阅读：Self-supervised video representation learning with space-time cubic puzzles

最新推荐文章于 2023-03-29 16:49:45 发布

小吴同学真棒

最新推荐文章于 2023-03-29 16:49:45 发布

阅读量324

点赞数

分类专栏：人工智能学习

本文链接：https://blog.csdn.net/qq_36627158/article/details/117327013

版权

cubic puzzle 自监督学习动作识别 video pretext AAAI

学习同时被 2 个专栏收录

115 篇文章 7 订阅

订阅专栏

人工智能

72 篇文章 4 订阅

订阅专栏

论文名称：Self-supervised video representation learning with space-time cubic puzzles（2019 AAAI）

论文作者：Dahun Kim, Donghyeon Cho, In So Kweon

下载地址：https://ojs.aaai.org/index.php/AAAI/article/view/4873

Contributions

In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs (3D ResNet-18) using large scale video dataset. Given a randomly permuted 3D spatio-temporal crops extracted from each video clips, we train a network to predict their original spatio-temporal arrangement. By completing Space-Time Cubic Puzzles, the network learns self-supervised video representation from unlabeled videos. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

Method

1、Pretext Task: Space-Time Cubic Puzzles

To generate the puzzle pieces, we consider a spatio- temporal cuboid consisting of 2 × 2 × 4 grid cells for each video. Then, we sample 4 crops instead of 16, in either spatial or temporal dimension. More specifically, the 3D crops are extracted from a 4-cell grid of shape 2×2×1 (colored in blue) or 1 × 1 × 4 (colored in red) along the spatial or temporal dimension respectively.

2、Network

We use a late-fusion architecture. It is a 4-tower siamese network, where the towers share the same parameters, and follow the 3D ResNet architecture. Each 3D crops are processed separately until the fully-connected layer. Furthermore, each towers are agnostic of whether it was spatial or temporal dimension the input crops had been sampled from. Similar to the jigsaw puzzle problem, we formulate the rearrangement problem as a multi-class classification task. In practice, for each tuple of four crops, we flip all the frames upside-down with 50% probability, doubling the number of classes to 48 (that is, 2×4!) to further boost our performance.

3、Avoiding Trivial Learning

When designing a pretext task, it is crucial to ensure that the task forces the network to learn the desired semantic structure, without bypassing the understanding by finding low-level clues that reveal the location of a video crop. So, we choose channel replication as our data preprocessing. Another often-cited worry in all context-based works relates to trivial low-level boundary pattern completion. Thus, we apply spatio-temporal jittering when extracting each video crops from the grid cells to avoid the trivial cases

4、Implementation Details

We use video clips with 224 × 224 pixel frames and convert every video file into PNG images in our experiments. We sample 128 consecutive frames from each clip, and split them into 2 × 2 × 4-cell grid; That is, one grid cell consists of 112 × 112 × 32 pixels, and for each cell, we sample 80×80×16 pixels with random jittering to generate a 3D video crop. During the fine-tuning and testing, we randomly sample 16 consecutive frames for each clip, and spatially resize the frames at 112×112 pixels. In testing, we adopt the sliding window manner to generate input clips, so that each video is split into non-overlapped 16-frame clips. The clip class scores are averaged over all the clips of the video.

Results

小吴同学真棒

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：Self-supervised video representation learning with space-time cubic puzzles

论文名称：Self-supervised video representation learning with space-time cubic puzzles（2019 AAAI）论文作者：Dahun Kim, Donghyeon Cho, In So Kweon下载地址：https://ojs.aaai.org/index.php/AAAI/article/view/4873ContributionsIn this paper, we introduce a new self..
复制链接

扫一扫

专栏目录