Matching-based Video Object Segmentation 典型算法总结 —— JYZhang_CVML

最新推荐文章于 2022-07-01 15:11:49 发布

JYZhang_sh

最新推荐文章于 2022-07-01 15:11:49 发布

阅读量2k

点赞数 1

分类专栏：机器学习视频检测和分析深度学习文章标签： Video Object Segmentation 视频分割深度学习基于匹配的分割

本文链接：https://blog.csdn.net/JYZhang_CVML/article/details/101384614

版权

机器学习同时被 3 个专栏收录

44 篇文章 7 订阅

订阅专栏

深度学习

33 篇文章 3 订阅

订阅专栏

视频检测和分析

10 篇文章 0 订阅

订阅专栏

最近稍微看了一些关于 semi-supervised video object segmentation (VOS) 的工作，其中注意到有几个算法具有明显的共同点——个人将其归结为 Matching-based Methods。这篇博客简单地总结和整理这些方法的最主要的 contribution，希望对各位的研究也有帮助。

什么叫 Matching-based Methods？

要解释这个概念得先理解另外一类 VOS 算法 —— Propagation-based Methods。可以参考几种比较经典的算法，如 MaskTrack¹ 和 RGMP²。网络的输入除了常规的当前帧的图像还需要之前帧的分割结果，对应的输出的监督为当前帧的分割结果。所以直观来说，很相似于把之前帧的分割结果向下一帧进行传播。

而现在讨论的 Matching-based Methods，一般利用 reference frame (如已标注的第一帧) 而不是 current frame 的前一帧。通过 pixel-level 的将当前待分割帧与 reference frame 进行匹配得到 similarity measure，进而得到分割结果。

PML：Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning³

1. 创新点

将匹配问题转化成 learned embedding space 中的 pixel-wise retrieval 问题。整体来说，学习一个CNN网络作为 embedding model，测试时，计算current frame 中的所有像素点 embedding 特征和 reference frame 中的像素点的 embedding特征的 KNN 来得到分割结果。

2. 算法 OVERVIEW

在这里插入图片描述

2.1 测试

2.1.1 测试过程

用 embedding network 对 reference image 和 test image 提取逐像素的 embedding feature。
对 reference image 中每个像素，在 embedding space 中寻找最邻近的 reference image 中的像素，并分配对应的 label。

2.1.2 Online Adaptation 的检索方法 —— 注意和 OnAVOS 和 OSVOS 的区别

不断的添加比较确定的像素进入 reference pool，后面对于 current pixels 的检索问题则根据这样不断增大的 reference pool 来进行。

2.2 训练

2.2.1 Embedding 模型

目标：相同 object 的像素 embedding 之后的特征需要尽可能的接近，相反，不同 object 的像素 embedding 之后的特征则需要尽可能的远离。
网络构架
- Base Feature Extractor：预训练好的分割模型，如 DeepLab-v2。
- Embedding Head：去除分割模型的最终分类层，用2个新的输出为 $d$ 维的卷积层代替。
- 考虑时域和空间域的信息：将坐标和帧数作为额外的输入送入 embedding head。因此 embedding 映射关系为 $e_{j, i}=f\left(x_{j, i}, i, j\right)$

2.2.2 Loss的设计

在这里插入图片描述
考虑同样的前景部分，可能如上图所示，车和人的embedding空间的分布不在一个聚类中。因此尽可能的使同一个 Object 中的 embedding feature 尽可能相似可能会影响 metric learning 的结果。
$\sum_{x^{a} \in \mathcal{A}}\left\{\min _{x^{p} \in \mathcal{P}}\left\|f\left(x^{a}\right)-f\left(x^{p}\right)\right\|_{2}^{2}-\min _{x^{n} \in \mathcal{N}}\left\|f\left(x^{a}\right)-f\left(x^{n}\right)\right\|_{2}^{2}+\alpha\right\}$

给定 $x^{a}$ 作为 anchor samples，和其标签相同的集合为 $\mathcal{P}$ ，反之为 $\mathcal{N}$ 。上述 loss function 使得同类的距离比异类的距离近。特别的， $x^{a}$ 的从视频中的一帧得到 (256 个)，然后另取两帧获得 $\mathcal{P}$ 和 $\mathcal{N}$ ，注意这两帧需要具有一定时间间隔来确保时域信息能够学习到。

3. 总结

用 Pixel-wise metric learning 来将匹配问题转化成了 embedding 问题，从而解决 online fine-tuning 的消耗时长问题。

PLM：Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks⁴

1. 创新点

经典的孪生网络框架实现 pixel-level 的匹配。其中还包含了一个特征压缩模块，也是值得参考的。

2. 算法 OVERVIEW

2.1 Two-Stage 训练策略

如果直接 training from scratch 在测试图像的某一帧上，容易导致模型 overfitting。因此首先在大量训练图像上进行训练 (具有每一帧标记的视频序列)，然后在测试视频的某一帧上进行 fine-tuning。

2.2 Network Architecture

在这里插入图片描述

训练输入：reference frame 和 current frame。对应的监督为 current frame 的分割结果。
通过蓝色的孪生网络结构分别对 reference frame 和 current frame 提取 multi-scale 的特征。其中 multi-scale 的特征通过压缩层进行压缩之后，联合送入MLP (红色) 中进行编码得到 similarity map。最后 similarity map 进行 object decoding (绿色) 得到最终的分割结果。
注意上述相似性度量的生成过程！！！！，similarity 的生成是 matching-based methods 的核心。

3. 总结

直观而言，这种 pixel-level matching 策略还是相当暴力的 —— 直接将两张图的 multi-scale feature 向量化并且 concatenate 在一起送入全连接层中进行卷积然后 reshape 就得到 similarity map 了… 这也未免太简单而暴力了吧，可解释性太弱了。

VideoMatch: Matching based Video Object Segmentation⁵

1. 创新点：

和 PLM 的工作类似，采用孪生网络提取 reference frame 和 current frame 的特征，只不过个人认为，这个工作后续计算特征之间的相似性更加具有可解释性。More elegant than PLM

2. 算法 OVERVIEW

在这里插入图片描述

2.1 训练过程

采用孪生网络分别对 reference image $I_1$ 和 current image $I_t$ 提取特征 $\mathbf{x}_{1} \in \mathbb{R}^{h \times w \times c}$ 和 $\mathbf{x}_{t} \in \mathbb{R}^{h \times w \times c}$ 。
对于 reference image 定义前景特征 $\mathbf{m}_F$ 和 $\mathbf{m}_{B}$ 通过提取对应前景部分和背景部分的 $\mathbf{x}_1$ 得到： $\mathbf{m}_{F}=\left\{\mathbf{x}_{1}^{i} : i \in g\left(y_{1}^{*}\right)\right\} , ~~~~~~and ~~~~~~ \mathbf{m}_{B}=\left\{\mathbf{x}_{1}^{i} : i \notin g\left(y_{1}^{*}\right)\right\}$
最重要的一步：计算 similarity map。采用 SOFT-MATCHING Layer，这个部分后面介绍。
再将两个 similarity map 进行 concatenate 和 normalization 操作得到分割结果。

2.2 Soft-Matching Layer (精彩的部分)

在这里插入图片描述

输入： $\mathbf{m}$ ( $\mathbf{m}_{F}$ 或者 $\mathbf{m}_{B}$ ) 和 $\mathbf{x}_t$ 。
输出： $S_{t} \in \mathbb{R}^{h \times w}$ ，其中 $S_{t}^{i}$ 表示 current frame 的第 $i$ 个像素特征和 $\mathbf{m}$ 的相似程度。
首次通过计算 余弦距离 $f\left(\mathbf{x}_{t}^{i}, \mathbf{m}^{j}\right)=\frac{\mathbf{x}_{t}^{j} \cdot \mathbf{m}^{j}}{\left\|\mathbf{x}_{t}^{j}\right\|\left\|\mathbf{m}^{j}\right\|}$ 得到 current frame 中像素 $i$ 与 reference frame 中像素 $j$ 的相似性： $A_{i j}=f\left(\mathbf{x}_{t}^{i}, \mathbf{m}^{j}\right) \in [-1,1]^{hw \times |\mathbf{m}|}$
对 $A$ 的每一行 (对应每一个 current frame 中的像素) 取最大的 $K$ 个值进行平均： $S_{t}^{i}=\frac{1}{K} \sum_{j \in \operatorname{Top}\left(A_{i}, K\right)} A_{i j}$
直观来说，为什么采用平均的方式？ —— 我们认为 current frame 中的一个像素点通常并不是仅仅匹配于 reference frame 中的单个像素点，而是某个图像区域。如果仅仅采用单个点匹配的方式，会导致噪声加大和过于破碎的前景背景分割等现象。

3. 总结

相比于上面的 PLM 方法，VideoMatch 中的计算 similarity map 的工作个人觉得还是具有很强的可解释性，在后续的工作中可以采用这样的计算 similarity 的方法。

RANet: Ranking Attention Network for Fast Video Object Segmentation⁶

1. 创新点

融合 Matching-based 和 Propagating-based 框架
在计算 similarity map 的时候提出 Rank Attention Module

2. Motivation

在这里插入图片描述
上面的示意图还是比较形象的，大家可以直观的区别开两种 semi-supervised video object segmentation 框架，然后联系本方法和传统的方法的区别。

3. 算法 OVERVIEW

在这里插入图片描述
这边我主要想总结一下和 VideoMatch 的区别：

VideoMatch 中仅仅计算前景区域和背景区域的 feature map，这边是放在一起计算然后后面再做一个 mask 操作。
而且这边使用了所有的similarity map，而 VideoMatch 中仅仅使用最大的 $K$ 个平均。

4. 总结

感觉这个工作的 pipeline 太复杂了，而且中间的操作尤其是 Ranking Attention Module 感觉确实缺少可解释性，包括最后的 Merge 部分也是。不过本文的 motivation 部分还是写的很好，具有一些参考价值。

PERAZZI, Federico, et al. Learning video object segmentation from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 2663-2672. ↩︎
WUG OH, Seoung, et al. Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 7376-7385. ↩︎
CHEN, Yuhua, et al. Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 1189-1198. ↩︎
SHIN YOON, Jae, et al. Pixel-level matching for video object segmentation using convolutional neural networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 2167-2176. ↩︎
HU, Yuan-Ting; HUANG, Jia-Bin; SCHWING, Alexander G. Videomatch: Matching based video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 54-70. ↩︎
WANG, Ziqin, et al. Ranet: Ranking attention network for fast video object segmentation. arXiv preprint arXiv:1908.06647, 2019. ↩︎

JYZhang_sh

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Matching-based Video Object Segmentation 典型算法总结 —— JYZhang_CVML

最近稍微看了一些关于 semi-supervised video object segmentation (VOS) 的工作，其中注意到有几个算法具有明显的共同点——个人将其归结为 Matching-based Methods。这篇博客简单地总结和整理这些方法的最主要的 contribution，希望对各位的研究也有帮助。什么叫 Matching-based Methods？要解释这个概念得先...
复制链接

扫一扫