论文阅读:Self-Supervised Video Representation Learning With Odd-One-Out Networks

目录

Contributions

Method

1、Model

2、Three sampling strategies.

3、Video frame encoding.

Results

More Reference to Follow

 


 

论文名称:Self-Supervised Video Representation Learning With Odd-One-Out Networks(2017 CVPR)

论文作者:Basura Fernando, Hakan Bilen, Efstratios Gavves, Stephen Gould

下载地址:https://openaccess.thecvf.com/content_cvpr_2017/html/Fernando_Self-Supervised_Video_Representation_CVPR_2017_paper.html

 


 

Contributions

We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning. In this task, we sample subsequences from videos and ask the network to learn to predict the odd video subsequence. The odd video subsequence is sampled such that it has wrong temporal order of frames while the even ones have the correct temporal order. Our learning machine is implemented as multi-stream convolutional neural network, which is learned end-to-end. Using odd-one-out networks, we learn temporal representations for videos that generalizes to other related tasks such as action recognition.

 


 

Method

1、Model

O3N is composed of (N+1) input branches, each contains five Convolutional layers and weights are shared across the input layers. Configuration of each input branch is identical to AlexNet architecture up to the first fully connected layer. We then introduce a fusion layer which merges the information from (N+1) branches after the first fully connected layer. We experiment with two fusion models, the Concatenation model and sum of difference model leading to two different network architectures as shown in Fig. 2.

  1. Concatenation model: The first fully connected layers from each branch are concatenated to give a (N + 1) × d dimensional vector, where d is the dimensionality of the first fully connected layer.
  2. Sum of difference model: The first fully connected layers from each branch are summed after taking the pair-wise activation difference leading to a d dimensional vector, where d is the dimensionality of the first fully connected layer. Mathematically, let vi be the activation vector of the i-th branch of the network. The output of the sum of difference layer is given by

 

 

2、Three sampling strategies.

  1. Consecutive sampling: We sample W number of consecutive frames N times from video X to generate N number of even (related) elements. Each sampled even element of the odd-one-out question is a valid video sub-clip consisting of W consecutive frames from the original video. However, the odd video sequence of length W is constructed by random ordering of frames and therefore does not satisfy the order constraints.
  2. Random sampling: We randomly sample W frames N times from the video X to generate N number of even (related) elements. Each of these N elements are sequences that has the correct temporal order and satisfy the original order constraints of X. However, the frames are not consecutive as in the case of consecutive sampling. The odd video sequence of length W is also constructed by randomly sampling frames. Similar to consecutive sampling strategy, the odd sequence does not satisfy the order constraints. Specifically, we randomly shuffled the frames of the odd element (sequence).
  3. Constrained consecutive sampling: First we sub select a video clip of size 1.5 ×W from the original video which we denote by Xˆ. We randomly sample W consecutive frames N times from Xˆ to generate N number of even (related) elements. Each of these N elements are subsequences that have the correct temporal order and satisfy the original order constraints of X. At the same time each of the sampled even video clips of size W overlaps more than 50% with each other. The odd video sequence of length W is also constructed by randomly sampling frames from Xˆ. Similar to other sampling strategies, the odd sequence does not satisfy the order constraints. Specifically, we randomly shuffled the frames of the odd element (sequence).

 

 

3、Video frame encoding.

Each element (video-clip or subsequence) in an odd-one-out question is encoded to extract temporal information before presenting to the first convolutional filters of the network. There are several ways to capture the temporal structure of a video sequence. For example, one can use 3D-convolutions, recurrent encoders, rank-pooling encoders or simply concatenate frames. Odd-one-out networks can use any of the above methods to learn video representations in self-supervised manner using video data. Next, we discuss three technique that is used in our experiments to encode video-frame-clips using the differences of RGB frames into a single tensor Xd.

  • Sum of differences of frames video-clip encoder: We take the difference of frames and then sum the differences to obtain a single image Xd. This single image captures the structure of the sequence. Precisely, this is exactly same as the equation 2 but now applied over frames. It is interesting to note that this equation boils down to a weighted average of frames such that Xd=∑wtXt where the weight of frame at index t is given by

 

  • Dynamic image encoder: This method is similar to the sum of differences of frames method, however the only difference is that now the input sequence is pre-processed to obtain a smoothed sequence M=<M1,M2,⋯MW>. Smoothing is obtain using the mean at index t. The smoothed frame at index t denoted by Mt is given by

       where Xj is the frame at index j of the sub-video.

 

  • Stack of differences of frames video-clip encoder: We also stack the difference of frames. However, now the resulting image is not any more a standard RGB image with three channels. Instead, we obtain (N − 1) × 3 channel image.

 


 

Results

 


 

More Reference to Follow

  1. traditionally unsupervised feature learning (e.g.[6, 20])【看看 video 怎么做 unsupervised】
  2. There has also been un- supervised temporal feature encoding methods to capture the structure of videos for action classification [13, 14, 15, 29, 36].
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值