论文阅读：SiamMask

最新推荐文章于 2020-07-19 20:46:36 发布

贾小树

最新推荐文章于 2020-07-19 20:46:36 发布

阅读量534

点赞数

分类专栏：目标跟踪目标分割论文阅读

本文链接：https://blog.csdn.net/j879159541/article/details/99122445

版权

论文阅读同时被 3 个专栏收录

74 篇文章 1 订阅

订阅专栏

目标分割

8 篇文章 0 订阅

订阅专栏

目标跟踪

4 篇文章 1 订阅

订阅专栏

一、对这篇论文的简单理解

1、SiamMask结合两种网络的任务，一个是目标跟踪网络，另一个是目标分割网络，对于vot指标，SiamMask以精度取胜，对于vos指标，SiamMask以速度取胜，以前的一些视频分割网络只能fps基本是1以下，但这个网络可以达到55fps，强！

2、以前的vot大部分是在线学习一个分类器，然后后面的帧可以根据情况更新模板再分类，是tracking-by-detection，比如kcf之类的方法；而Siamese系列的跟踪网络是学习第一帧的模板与搜索区域的相似性–response map，（ROW），把模板的feature map当作是卷积核与搜索区域的feature map进行卷积操作，这里用来depth-wise卷积产生多通道的ROW，可以编码更丰富的信息。

3、一个ROW只预测一个mask，和MaskRcnn不一样，它是预测k个mask，k是类别；还要说明一点，box的分支是每个预测k个box，但这个K是提前设置的不同尺寸不同长宽比的框的数量。

4、如何根据生成的mask产生用于vot指标的框对评测也有影响，论文结合精度和速度选用了MBR

5、训练数据集： COCO [31], ImageNet-VID [47] and YouTube-VOS [58]

6、在vot方面，超越了DaSiamRPN和kcf，decay小，更适合长视频

二、性能比较，论文中给出的数据

1、网络结构图，但实际不是如论文中figure2这么简单的，还有refine模块和adjust层，在附录里有具体展示，这里也给出：：
在这里插入图片描述

在这里插入图片描述

2、与vot方面的sota工作对比

在这里插入图片描述
3、与vos方面的sota工作对比

在这里插入图片描述

4、ablation studies

在这里插入图片描述

三、对自己有益的原句摘抄

1、
It finds use in a wide range of scenarios
such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.
这里是指视频跟踪

2、
In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,
a simple multi-task learning approach that can be used
to address both problems.

3、
To achieve this goal, we simultaneously train a Siamese
network on three tasks, each corresponding to a different
strategy to establish correspondances between the target object and candidate regions in the new frames.

4、
Performance of Correlation Filter-based
trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])

5、这个不太理解，需要继续学习
In order to exploit consistency between video frames,
several methods propagate the supervisory segmentation
mask of the first frame to the temporally adjacent ones via
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
particular, Bao et al. [1] recently proposed a very accurate
method that makes use of a spatio-temporal MRF in which
temporal dependencies are modelled by optical flow, while
spatial dependencies are expressed by a CNN

6、
The loss function Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs:

7、
In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information
throughout the network, our approach follows the spirit
of [43, 44] and generates masks starting from a flattened representation of the object.

8、这个也不理解
Similarly to most VOS
methods, in case of multiple objects in the same video
(DAVIS-2017) we simply perform multiple inferences

9、
Interestingly, the refinement approach of Pinheiro et al. [44]
is very important for the contour accuracy FM, but less so
for the other metrics.

贾小树

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：SiamMask

一、对这篇论文的简单理解1、SiamMask结合两种网络的任务，一个是目标跟踪网络，另一个是目标分割网络，对于vot指标，SiamMask以精度取胜，对于vos指标，SiamMask以速度取胜，以前的一些视频分割网络只能fps基本是1以下，但这个网络可以达到55fps，强！2、以前的vot大部分是在线学习一个分类器，然后后面的帧可以根据情况更新模板再分类，是tracking-by-detect...
复制链接

扫一扫