视频目标分割之See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Network

最新推荐文章于 2024-06-17 08:31:25 发布

有为少年

最新推荐文章于 2024-06-17 08:31:25 发布

阅读量2.5k

点赞数 2

分类专栏：深度学习 # 分割任务文章标签：深度学习视频目标分割无监督学习

本文链接：https://blog.csdn.net/P_LarT/article/details/105859057

版权

深度学习同时被 2 个专栏收录

149 篇文章 19 订阅

订阅专栏

分割任务

12 篇文章 2 订阅

订阅专栏

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

文章目录

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

原始链接：https://www.yuque.com/lart/papers/yncvdo

这是一篇无监督视频目标分割的文章，收录于CVPR 2019。

亮点

随机抽选成对的输入帧进行训练，既可以学习帧间关系，又可以扩充数据
使用全局参考帧来辅助推理，从全局视角获得更好的预测效果
利用类似Non-Local的结构利用帧间关系对自身特征进行辅助和引导
使用显著性检测预训练特征提取网络，从而更好的对无监督视频目标分割任务进行目标定位

任务介绍

从DAVIS竞赛网站（https://davischallenge.org/index.html）上可以看到给出的定义：

Semi-supervised and Unsupervised refer to the level of human interaction at test time, not during the training phase.

In Semi-supervised, better called human guided, the segmentation mask for the objects of interest is provided in the first frame.
In Unsupervised, better called human non-guided, no human input is provided.

Video Obejct Segmentation

UVOS: unsupervised video object segmentation
- object需要网络自己分辨
- 训练时，每帧真值（二值掩码）都会给定；测试时，完全无真值提供
- 关键：如何确定目标，如何保持目标
- 本文方法：
  - 基于显著性目标检测数据集预训练特征提取网络来更好的定位目标
  - 基于全局视角和视频内在信息的关联来实现目标分割的保持和优化
SVOS: semi-supervised video object segmentation
- object在第一帧给出
- 训练时，每帧真值（二值掩码）都会给定；测试时，除第一帧外，完全无真值提供

方法介绍

包含五个部分

Frame Pair
Feature Embeding
Co-attention
Segmentation
Loss

Frame Pair

不同阶段凑对数量不同
- 训练期间从相同视频中随机选择帧凑对儿作为输入
- 测试期间对于特定帧会选择数个参考帧共同输入
凑对儿的目的与好处
- 它对于增加训练数据非常有效。它允许在同一视频内使用大量的任意成对帧进行训练
- 在训练过程中，可以看作是在基于同一视频中任意帧对之间的相关性进行学习。这也可以认为是一种上下文信息的学习
- 在测试过程中，网络以全局视图推理主要目标，即利用了测试帧和多个参考帧之间的co-attention information

Feature Embeding

DeepLab V3
- the first five convolution blocks from ResNet
- an atrous spatial pyramid pooling (ASPP) module
训练时
- 使用MSRA10K和DUT的数据微调DeepLab V3
- 此时会附加一些用于预测显著性图的结构，训练完后舍弃，仅保留特征提取结构
- 此时的监督使用交叉熵

（关键）Co-attention

挖掘并利用帧间关联信息。

这里设计了三种不同的计算相似矩阵S的方式：

Vanilla co-attention
- 等式3中的P和D是对方阵W进行对角化后的得到的，P是可逆矩阵，D是对角矩阵。
- 等式3中表示，首先对每帧的特征表示进行线性变换，然后计算其任意位置之间的距离。
- 实现：a fully connected layer with 512×512 parameters

Symmetric co-attention
- If we further constrain the weight matrix to be a symmetric matrix, the project matrix P becomes an orthogonal matrix.
- Eq. 4 indicates that we project the feature embeddings Va and Vb into an orthogonal common space and maintain their norm of Va and Vb. This property has proved valuable for eliminating the correlation between different channels (i.e., C- dimension) [Svdnet for pedestrian retrieval] and improving the network’s generalization ability [Neural photo editing with introspective adversarial networks, Regularizing cnns with locally constrained decorrelations].
- 实现：在分割的损失上添加对于W的正交正则化约束，结构（据我推测，文中没有明确说明）与Vanilla一致

Channel-wise co-attention
- Furthermore, the project matrix** P can be simplified into an identity matrix I** (i.e., without space transformation), and then the weight matrix W becomes a diagonal matrix. x. In this case, W (i.e., D) can be
  further diagonalized into two diagonal matrices Da and Db.
- This operation in Equ. 5 is equal to applying a channel-wise weight to Va and Vb before computing the similarity. This helps to alleviate channel-wise redundancy, which shares a similar spirit to Squeeze-and-Excitation mechanism [SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning, Squeeze-and-excitation networks].
- 实现：It is
  built on a Squeeze-and-Excitation (SE)-like module

进一步调整相似矩阵以便于作用于原始特征上：

之后作用到原始特征上，得到一组包含有其他帧的信息的特征：

Considering the underlying appearance variations between input pairs, occlusions, and background noise, it is better to weight the information from different input frames, instead of treating all the co-attention information equally：

相当于这里用Z自身计算了一个0~1之间的权重，对自身进行了限制。

不同的训练和测试结构：训练时成对输入；测试时有多种策略，对于每个测试帧：

仅有单个参考帧
Prediction segmentation fusion：从同一视频中均匀采样N帧，与测试帧成对处理后得到的结果取均值
Attention summary fusion：从同一视频中均匀采样N帧，与测试帧在特征空间成对处理后得到的特征取均值

Segmentation

two 3×3 convolutional layers (with 256 filters and batch norm)
a 1×1 convolutional layer (with 1 filter and sigmoid activation) for final segmentation prediction