CVPR2018--Video object segmentation--5

最新推荐文章于 2024-06-05 09:55:36 发布

乐兮山南水北

最新推荐文章于 2024-06-05 09:55:36 发布

阅读量2.2k

点赞数

分类专栏：论文阅读文章标签： CVPR 2018 video object segmentation

本文链接：https://blog.csdn.net/u012494820/article/details/83096406

版权

论文阅读专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

在这里插入图片描述

Summary

This paper provides a Kinetics Human Action Video dataset, including 400 human action classes and over 400 clips per class. This research proposes a two-stream inflated 3D ConvNets, which is possible to learn spatio-temporal feature extractor from video while leverage successful ImageNet architecture designs and even their parameters.

Research Background

Deep converts pre-trained on ImageNet dataset have warm start or suffice entirely many image-based computer vision tasks. This article studies whether training an action classification network on a sufficiently large dataset, will bring similar boost on performance when applied to a different temporal task or dataset.
Old 1: ConvNet + LSTM
Use convnet to independently extract features for each frame, add an recurrent layer to encode state, capture temporal order and long range dependancies.
Old 2: 3D ConvNets
Equipped with spatial-temporal filters, 3D convnet is a natural approach for video process. However, 3D convnets models contain much more parameters than the corresponding 2D colleagues, making them hard to learn.
Old 3: Two-Stream network
The network consists of two branches to separately make prediction from a single RGB frame and a stack of 10 externally computed optical flow frames. Then, two predictions are averaged to obtain the final decision.
在这里插入图片描述

Proposed method

Two-Stream Inflated 3D ConvNets

Convert 2D convnets into 3D: inflate. Inflate all the 2D filters and pooling kernels, convert $N\times N$ to N\times N\times N
Bootstrapping 3D filters from 2D filters: repeat the 2D filters N times along the time dimension, then rescale them by dividing by N.
Pacing receptive field growth in space, time and network depth: verified by experiment.
Two 3D streams. The inputs for two streams are RGB and flow, respectively. The two networks are trained separately. In evaluation, the two predictions are averaged.

启发

3D convolution，inflation等对VOS任务可能有帮助。本文提出，在提取video的特征时，可以利用ImageNet上设计的处理image的网络结构甚至预训练的参数，值得一看。

Learning to Adapt Structured Output Space for Semantic Segmentation

在这里插入图片描述

Summary

Relayed on pixel-level supervision data, CNN-based methods can achieve good semantic segmentation performance, but the generalization ability to unseen image domains is not strong enough. This work proposes to learn adaptation in the segmentation output space. The motivation is that, given two images with different appearance, their segmentation outputs are structured and share many similarities, e.g., spatial layout and local texture. The proposed method constructs a multi-level adversarial network to effectively perform output space domain adaptation at different feature level.

Path Aggregation Network for Instance Segmentation

在这里插入图片描述

summary

Information propagation is important in neural network. Based on Mask RCNN, this work adds a bottom-up feature propagation path, so as to apply low level layers feature (with accurate localization signal) to enhance entire feature. A new layer named adaptive feature pooling layer is proposed.

Unsupervised Learning and Segmentation of Complex Activities from Video

summary

This work proposes an iterative discriminative-generative approach for video segment. It consists of two parts, one discriminatively learns the appearance of sub-activities from the videos’ visual feature to activity labels. The other generatively model the temporal structure of sub-activities using a Generalized Mallows Model.
启迪：考虑appearance feature和temporal structure

Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network

在这里插入图片描述

summary

This paper studies synthesizes medical images with CT data and MRI data. There are three goals: (1) Synthesizing real-looking 3D images using unpaired training data. (2) Ensuring consistent anatomical structure. (3) Improve volume segmentation. The proposed framework consists of mutually-beneficial generators and segmentors, the former is responsible for image synthesis while the latter is responsible for segmentation.
启迪：（1）使用3D-CNN做segmentation，医学数据是channel数目较多。（2）合成与segmentation同时做，彼此提升效果。

乐兮山南水北

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
CVPR2018--Video object segmentation--5

Quo Vadis, Action Recognition? A New Model and the Kinetics DatasetSummaryThis paper provides a Kinetics Human Action Video dataset, including 400 human action classes and over 400 clips per class....
复制链接

扫一扫

专栏目录