论文阅读:SiamMask

一、对这篇论文的简单理解

1、SiamMask结合两种网络的任务,一个是目标跟踪网络,另一个是目标分割网络,对于vot指标,SiamMask以精度取胜,对于vos指标,SiamMask以速度取胜,以前的一些视频分割网络只能fps基本是1以下,但这个网络可以达到55fps,强!

2、以前的vot大部分是在线学习一个分类器,然后后面的帧可以根据情况更新模板再分类,是tracking-by-detection,比如kcf之类的方法;而Siamese系列的跟踪网络是学习第一帧的模板与搜索区域的相似性–response map,(ROW),把模板的feature map当作是卷积核与搜索区域的feature map进行卷积操作,这里用来depth-wise卷积产生多通道的ROW,可以编码更丰富的信息。

3、一个ROW只预测一个mask,和MaskRcnn不一样,它是预测k个mask,k是类别;还要说明一点,box的分支是每个预测k个box,但这个K是提前设置的不同尺寸不同长宽比的框的数量。

4、如何根据生成的mask产生用于vot指标的框对评测也有影响,论文结合精度和速度选用了MBR

5、训练数据集: COCO [31], ImageNet-VID [47] and YouTube-VOS [58]

6、在vot方面,超越了DaSiamRPN和kcf,decay小,更适合长视频

二、性能比较,论文中给出的数据

1、网络结构图,但实际不是如论文中figure2这么简单的,还有refine模块和adjust层,在附录里有具体展示,这里也给出::
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

2、与vot方面的sota工作对比

在这里插入图片描述
3、与vos方面的sota工作对比

在这里插入图片描述

4、ablation studies

在这里插入图片描述

三、对自己有益的原句摘抄

1、
It finds use in a wide range of scenarios
such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition.
这里是指视频跟踪

2、
In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask,
a simple multi-task learning approach that can be used
to address both problems.

3、
To achieve this goal, we simultaneously train a Siamese
network on three tasks, each corresponding to a different
strategy to establish correspondances between the target object and candidate regions in the new frames.

4、
Performance of Correlation Filter-based
trackers has then been notably improved with the adoption of multi-channel formulations [24, 20], spatial constraints [25, 13, 33, 29] and deep features (e.g. [12, 51])

5、这个不太理解,需要继续学习
In order to exploit consistency between video frames,
several methods propagate the supervisory segmentation
mask of the first frame to the temporally adjacent ones via
graph labeling approaches (e.g. [55, 41, 50, 36, 1]). In
particular, Bao et al. [1] recently proposed a very accurate
method that makes use of a spatio-temporal MRF in which
temporal dependencies are modelled by optical flow, while
spatial dependencies are expressed by a CNN

6、
The loss function Lmask (Eq. 3) for the mask prediction task is a binary
logistic regression loss over all RoWs:

7、
In contrast to semantic segmentation methods in the style of FCN [32] and Mask RCNN [17], which maintain explicit spatial information
throughout the network, our approach follows the spirit
of [43, 44] and generates masks starting from a flattened representation of the object.

8、这个也不理解
Similarly to most VOS
methods, in case of multiple objects in the same video
(DAVIS-2017) we simply perform multiple inferences

9、
Interestingly, the refinement approach of Pinheiro et al. [44]
is very important for the contour accuracy FM, but less so
for the other metrics.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值