目录
2、The proposed Order Prediction Network(OPN)
论文名称:Unsupervised Representation Learning by Sorting Sequences
Summary
- 本篇文章的上游任务是:正确识别(给出)视频里 4 帧打乱视频帧的正确顺序。
- 所用到的 Tricks:(文章中对这些 tricks 中的每一个都有大量消融实验证明是有效的)
- Data sampling strategies. (a) We use a sliding windows approach on the optical flow fields to extract patches tuple with large motion magnitude.
- apply spatial jittering and channel splitting on selected patches to guide the network to focus on the semantics of the images rather than fixating on low-level features.
- The proposed Order Prediction Network(OPN) consists of three main components: (1) feature extraction, (2) pairwise feature extraction, and (3) order prediction. Features for each frame ( fc6) are encoded by convolutional layers. The pairwise feature extraction stage then extracts features from every pair of frames. We then have a final layer that takes these extracted features to predict order.
- 下游任务的:action recognition(UCF101), image classification(VOC), and object detection tasks(VOC)
- backbone:CaffeNet [16], a slight modification of AlexNet
Details
1、Task
Specifically, we use up to four randomly shuffled frames sampled from a video as our input.
Similar to the jigsaw puzzle problem in the spatial domain [27], we formulate the sequence sorting problem as a multi-class classification task.
For each tuple of four frames, there are 4! = 24 possible permutations.
However, as some actions are both coherent forward and backward (e.g., opening/closing a door), we group both forward and backward permutations into the same class (e.g., 24/2 classes for four frames).
2、The proposed Order Prediction Network(OPN)
3、Data sampling strategies
4、Ablation analysis
Trick 1
Data sampling strategies. (a) We use a sliding windows approach on the optical flow fields to extract patches tuple with large motion magnitude.
Trick 2
apply spatial jittering and channel splitting on selected patches to guide the network to focus on the semantics of the images rather than fixating on low-level features.
Trick 3
The proposed Order Prediction Network(OPN) consists of three main components: (1) feature extraction, (2) pairwise feature extraction, and (3) order prediction. Features for each frame ( fc6) are encoded by convolutional layers. The pairwise feature extraction stage then extracts features from every pair of frames. We then have a final layer that takes these extracted features to predict order.
想法 & 思考
原本我以为 temporal order recognition 的单位是 video clips,所以 backbone 应该是 C3D 之类的;万万没想到是 temporal order verification 的单位是 frames,所以 backbone 是 CNN。