Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Summary
This paper provides a Kinetics Human Action Video dataset, including 400 human action classes and over 400 clips per class. This research proposes a two-stream inflated 3D ConvNets, which is possible to learn spatio-temporal feature extractor from video while leverage successful ImageNet architecture designs and even their parameters.
Research Background
Deep converts pre-trained on ImageNet dataset have warm start or suffice entirely many image-based computer vision tasks. This article studies whether training an action classification network on a sufficiently large dataset, will bring similar boost on performance when applied to a different temporal task or dataset.
Old 1: ConvNet + LSTM
Use convnet to independently extract features for each frame, add an recurrent layer to encode state, capture temporal order and long range dependancies.
Old 2: 3D ConvNets
Equipped with spatial-temporal filters, 3D convnet is a natural approach for video process. However, 3D convnets models contain much more parameters than the corresponding 2D colleagues, making them hard to learn.
Old 3: Two-Stream network
The network consists of two branches to separately make prediction from a single RGB frame and a stack of 10 externally computed optical flow frames. Then, two predictions are averaged to obtain the final decision.
Proposed method
Two-Stream Inflated 3D ConvNets
- Convert 2D convnets into 3D: inflate. Inflate all the 2D filters and pooling kernels, convert N × N N\times N N×N to N\times N\times N
- Bootstrapping 3D filters from 2D filters: repeat the 2D filters N times along the time dimension, then rescale them by dividing by N.
- Pacing receptive field growth in space, time and network depth: verified by experiment.
- Two 3D streams. The inputs for two streams are RGB and flow, respectively. The two networks are trained separately. In evaluation, the two predictions are averaged.
启发
3D convolution,inflation等对VOS任务可能有帮助。本文提出,在提取video的特征时,可以利用ImageNet上设计的处理image的网络结构甚至预训练的参数,值得一看。
Learning to Adapt Structured Output Space for Semantic Segmentation
Summary
Relayed on pixel-level supervision data, CNN-based methods can achieve good semantic segmentation performance, but the generalization ability to unseen image domains is not strong enough. This work proposes to learn adaptation in the segmentation output space. The motivation is that, given two images with different appearance, their segmentation outputs are structured and share many similarities, e.g., spatial layout and local texture. The proposed method constructs a multi-level adversarial network to effectively perform output space domain adaptation at different feature level.
Path Aggregation Network for Instance Segmentation
summary
Information propagation is important in neural network. Based on Mask RCNN, this work adds a bottom-up feature propagation path, so as to apply low level layers feature (with accurate localization signal) to enhance entire feature. A new layer named adaptive feature pooling layer is proposed.
Unsupervised Learning and Segmentation of Complex Activities from Video
summary
This work proposes an iterative discriminative-generative approach for video segment. It consists of two parts, one discriminatively learns the appearance of sub-activities from the videos’ visual feature to activity labels. The other generatively model the temporal structure of sub-activities using a Generalized Mallows Model.
启迪:考虑appearance feature和temporal structure
Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network
summary
This paper studies synthesizes medical images with CT data and MRI data. There are three goals: (1) Synthesizing real-looking 3D images using unpaired training data. (2) Ensuring consistent anatomical structure. (3) Improve volume segmentation. The proposed framework consists of mutually-beneficial generators and segmentors, the former is responsible for image synthesis while the latter is responsible for segmentation.
启迪:(1)使用3D-CNN做segmentation,医学数据是channel数目较多。(2)合成与segmentation同时做,彼此提升效果。