CVPR2018--Video object segmentation--5

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

在这里插入图片描述

Summary

This paper provides a Kinetics Human Action Video dataset, including 400 human action classes and over 400 clips per class. This research proposes a two-stream inflated 3D ConvNets, which is possible to learn spatio-temporal feature extractor from video while leverage successful ImageNet architecture designs and even their parameters.

Research Background

Deep converts pre-trained on ImageNet dataset have warm start or suffice entirely many image-based computer vision tasks. This article studies whether training an action classification network on a sufficiently large dataset, will bring similar boost on performance when applied to a different temporal task or dataset.
Old 1: ConvNet + LSTM
Use convnet to independently extract features for each frame, add an recurrent layer to encode state, capture temporal order and long range dependancies.
Old 2: 3D ConvNets
Equipped with spatial-temporal filters, 3D convnet is a natural approach for video process. However, 3D convnets models contain much more parameters than the corresponding 2D colleagues, making them hard to learn.
Old 3: Two-Stream network
The network consists of two branches to separately make prediction from a single RGB frame and a stack of 10 externally computed optical flow frames. Then, two predictions are averaged to obtain the final decision.
在这里插入图片描述

Proposed method

Two-Stream Inflated 3D ConvNets

  1. Convert 2D convnets into 3D: inflate. Inflate all the 2D filters and pooling kernels, convert N × N N\times N N×N to N\times N\times N
  2. Bootstrapping 3D filters from 2D filters: repeat the 2D filters N times along the time dimension, then rescale them by dividing by N.
  3. Pacing receptive field growth in space, time and network depth: verified by experiment.
  4. Two 3D streams. The inputs for two streams are RGB and flow, respectively. The two networks are trained separately. In evaluation, the two predictions are averaged.
启发

3D convolution,inflation等对VOS任务可能有帮助。本文提出,在提取video的特征时,可以利用ImageNet上设计的处理image的网络结构甚至预训练的参数,值得一看。

Learning to Adapt Structured Output Space for Semantic Segmentation

在这里插入图片描述

Summary

Relayed on pixel-level supervision data, CNN-based methods can achieve good semantic segmentation performance, but the generalization ability to unseen image domains is not strong enough. This work proposes to learn adaptation in the segmentation output space. The motivation is that, given two images with different appearance, their segmentation outputs are structured and share many similarities, e.g., spatial layout and local texture. The proposed method constructs a multi-level adversarial network to effectively perform output space domain adaptation at different feature level.

Path Aggregation Network for Instance Segmentation

在这里插入图片描述

summary

Information propagation is important in neural network. Based on Mask RCNN, this work adds a bottom-up feature propagation path, so as to apply low level layers feature (with accurate localization signal) to enhance entire feature. A new layer named adaptive feature pooling layer is proposed.

Unsupervised Learning and Segmentation of Complex Activities from Video

summary

This work proposes an iterative discriminative-generative approach for video segment. It consists of two parts, one discriminatively learns the appearance of sub-activities from the videos’ visual feature to activity labels. The other generatively model the temporal structure of sub-activities using a Generalized Mallows Model.
启迪:考虑appearance feature和temporal structure

Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network

在这里插入图片描述

summary

This paper studies synthesizes medical images with CT data and MRI data. There are three goals: (1) Synthesizing real-looking 3D images using unpaired training data. (2) Ensuring consistent anatomical structure. (3) Improve volume segmentation. The proposed framework consists of mutually-beneficial generators and segmentors, the former is responsible for image synthesis while the latter is responsible for segmentation.
启迪:(1)使用3D-CNN做segmentation,医学数据是channel数目较多。(2)合成与segmentation同时做,彼此提升效果。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值