原 CVPR2018-video object segmentation--4

最新推荐文章于 2024-08-12 08:48:56 发布

乐兮山南水北

最新推荐文章于 2024-08-12 08:48:56 发布

阅读量2k

点赞数 2

分类专栏：论文阅读文章标签： CVPR2018 video object segmentation

本文链接：https://blog.csdn.net/u012494820/article/details/82804773

版权

论文阅读专栏收录该内容

12 篇文章 0 订阅

订阅专栏

SeGAN: Segmenting and Generating the Invisible

略读，motivation

This work strives to complete the appearance of the occluded objects via two steps: segmenting the invisible parts of the objects and generating the appearance of these parts. This work combines a segmentation network and a GAN to achieve this goal.

Dynamic Video Segmentation Network

在这里插入图片描述

Research Background

This work tackle the fast and efficient semantic video segmentation tasks. Contemporary state-of-the-art CNN models for image semantic segmentation usually relay on deep network architectures to achieve accurate performance. The tremendous computation burden brings these method away from real-time applications.

Motivation and proposed method

This work proposes to adaptively apply two different neural networks to different regions of the frames, reusing spatial and temporal continuity as much as possible. The author elaborates this motivation from two perspectives. The first one is that only a small portion of frames are apparently different among consecutive frames, implying that a large portion of feature maps between these frames is invariant, or just varies slightly. Intuitively, we should reuse features for similar frame regions, while extract new features for regions with distinct differences. The second one is elaborated from temporal correlation view.

优势

写作有很出彩的地方

潜在的不足

将每个frame均匀地划分为4份，目标物体很可能被生硬地切开。（1）已有的segmentation model不能很好的分割物体，需要重新训练；（2）在只看到一部分物体的情况下，想要分割出物体，难度大。
segmentation的效果在较大程度上依赖于deeper network的效果，虽然shallow network参数较少，但一个网络跑多次，数据的读写和最终结果的合并都比较耗时。

[Transcribe]: Due to real-time requirements, the applications of video object segmentation methods typically require high frame rates per seconds (fps), necessitating short inference latency in the perception modules. Unfortunately, contemporary state-of-the-art CNN models usually employ deep network architectures to extract high-level features from raw data, leading to exceptionally long inference time. The well-known models proposed for semantic image segmentation, including fully convolutional networks (FCN), DeepLab, PSPNet, ResNet-38, RefineNet, dense upsampling convolution (DUC), etc., are not suitable for real-time video object segmentation due to their usage of deep network architectures. These models usually incorporate extra layers for boosting their accuracies, such as spatial pyramid pooling (SPP), multi-scale dilated convolution, multi-scale input paths, multi-scale feature paths, global pooling, and conditional random field (CRF). These additional layers consume tremendous amount of computational resources to process every pixel in an image, leading to impractical execution time.
写法技巧：对领域内有影响力的工作很熟悉，能够提炼出它们的共同点。提出方法的共同点，举例子论述，最终得出目前做法共同的不足，使得行文连贯，且更具有说服力。

Semantic Video Segmentation by Gated Recurrent Flow Propagation

在这里插入图片描述

Research background

Although single frame, static segmentation models have shown overwhelming effect, full trainable approaches for semantic video segmentation are rare, limited by the quantity of detailed annotations and the computation burden.
Models based on 3D-convolutions have been used to jointly learn video processing and temporal matching. These method do not require explicit connections among neighbor frames. The 3D-convolution has been used for action recognition, but not for video segmentation.

Motivation and proposed method

In order to design the semantic video segmentation models in the long range, this work proposes to sparsely label frames, then leverage temporal dependencies to propagate and aggregate information, so as to decrease uncertainty. Precisely, this work propose a spatial transformer structure along with optical flow warping operation, which is used to propagate information among nearby frames. A single frame CNN and the spatial transformer structure are combined together via adaptive recurrent units in order to fuse single-frame estimation with propagated one.

优点

将GRU模块与VOS任务整合，做出了效果
不是使用上一帧的mask，而是根据光流进行投影

疑惑

在DAVIS数据集上训练、测试，效果如何？
这个工作与ICCV17：Learning Video Object Segmentation With Visual Memory的不同点是什么？稀疏标注，对mask投影
如何利用稀疏标注数据？标注有多么稀疏？

Instance Embedding Transfer to Unsupervised Video Object Segmentation

略读，motivation

This work learns instance embedding from static images, then incorporate the embedding with objectness and optical flow feature, finally segment video object frame by frame.

疑惑

本工作与ICCV 2017工作Segmentation-Aware Convolutional Networks Using Local Attention Masks比较相似，都是学习embedding
本研究使用image data训练，但却宣城自己属于unsupervised video object segmentation方法