DiMP:Learning Discriminative Model Prediction for Tracking

最新推荐文章于 2023-03-12 20:33:38 发布

xwmwanjy666

最新推荐文章于 2023-03-12 20:33:38 发布

阅读量4.1k

点赞数 3

分类专栏：论文阅读文章标签： DIMP

本文链接：https://blog.csdn.net/xwmwanjy666/article/details/98500578

版权

论文阅读专栏收录该内容

15 篇文章 1 订阅

订阅专栏

文章目录

ATOM、DiMP、Siamfc
Abstract
Introduction
Related Work
Method
Baseline Analysis

作者是瑞士(Switzerland) ETH Zurich实验室的，Goutam Bhat,Martin Danelljan. Martin Danalljan主页
dimp论文链接

ATOM、DiMP、Siamfc

首先看到题目的disciminative为什么叫判别性的模型预测呢？我想，它加入了背景，不仅仅只关注于目标，输入的是全图，然后才进行相应的裁剪（atom输入的也是全图，最后根据bbox进行裁剪，是bbox的5倍大，然后resize成固定大小），最重要的是设置的判别loss，与ATOM相比，在离线训练中加入了分类的训练，并且输入的不是单张图，而是训练集，并设置了判别损失，但是我觉得ATOM也有背景信息，也是在bbox大小的5倍作为输入，速度比ATOM 快，是不是离线多学点性能和速度都很好，loss公式太多，，，，另外ATOM是离线更新和在线训练都存在，而DiMP也是离线训练分类器和iounet，随后在线更新分类层
siamfc模板分支输入的是一张图片，而这里输入的是多张data samples,解决了siamfc的一些限制

Abstract

end-to-end training is important
Siamese paradigm possesses limited discriminative power due to its inability of integrating background information.例如siamesefc首先对以目标中心进行裁剪，公式为s(w+2p) x s(h+zp)=A,先进行padding最后resize为127x127大小的图像，因此背景信息很少
so，it develops an end-to-end tracking architecture,capable of fully exploiting both target and background appearance information for target model prediction.(Dimp输入的是原图，裁剪区域是目标的5倍，提取整个特征后，使用PrPooling进行目标特征提取）
最主要的是，论文设计了一个discriminative learning loss and optimization process.
40FPS,VOT2018 EAO 0.440

Introduction

Most current approaches address the tracking problem by constructing a target model,用来区别背景和目标，但是特定的目标信息只能在测试阶段获得，不能离线学习
Siamese learning frame-work suffers from severe limitations:
1.仅仅利用target appearance when inferring the model,忽略了背景信息
2.学习一个相似度量对于不包括在离线训练集中的目标是不那么可靠的，很难泛化（poor generalization）
3.没有提供模型更新策略
这篇论文解决了如上的限制。
we take inspiration from the discriminative learning produres ，如MDnet ,ATOM等，基于target model prediction network.

Related Work

有时间细读

Method

在这里插入图片描述
以上便是整个框架，可以发现训练集并不是一张图片，而是有多张，输入的是裁剪过后的image（是目标大小的5倍，然后resize到固定的大小）以及bbox，经过backbone提取特征，再经过一个conv（提取特定的分类特征，后面做实验说可以提升效果），得到的特征输入到Model Predictor D中，这里包括模型初始化（an initializer network that efficiently provides an initial estimate of the model weights, using only the target appearance.），这个模块包括一个卷积层后面跟着一个PROI pooling，最后池化后的特征，得到最初的 $f_0$
模型更新（taking both target and background samples into account.），最后得到f，做为test的filter，得到最后的score map。这是建立target model的过程，bbox estimation 和ATOM一样，详细的可以看ATOM。
我们的目标是预测一个目标模型：f=D(S train)
在这里插入图片描述 ,也就是输入Model Initializer的两个箭头
那模型更新训练的loss呢？ Disciminative learning loss（最小二乘法）

r(s,c)计算的是残差，s=x*f,c为真值的中心坐标
follow the philosophy of Support Vector Machines,employ a hinge-like loss in r:
在这里插入图片描述
spatial weight function $v_c$ (当在目标中心时增加，当在模糊的转化区域减小它）,target region $m_c$ ,
一般mc ≈ 1 at the target and mc ≈ 0 in the background region,那在目标和背景的转化区域怎么定义呢？这些都是根据经验和误差来设置，然而我们通过数据去学习。
在这里插入图片描述

那怎么优化loss呢？
update filter:

the straight-forward option is to then employ gradient descent using a step length a,但是这样很慢，需要多次迭代，so the core idea is to compute the step length a based on the steepest descent methodology ,which is a common optimization technique.
首先,在当前估计f用二次函数近似loss:
在这里插入图片描述

残差的雅克比式

总结为以下算法：
在这里插入图片描述
那分类的训练的loss呢？

Here, regression label $z_c$ is set to a Gaussian function centered as the target c.

s predicted confidence score
z label

加上bbox estimation 的l均方oss，得到
在这里插入图片描述
细节:backbone networks 被ImageNet weights初始化，使用ResNet框架,在TrackingNet，LaSOT，GOT10k and COCO数据集上进行初始化，训练了50次，per epoch sampleing 20000 videos
We set the base scale to 5 times the target size to incorporate significant background information.
Online tracking :
给定标注的第一帧，我们使用数据增强策略得到15 samples，进行特征提取，送入到Model Predictor D中，得到f，最后得到预测的分数，f每20帧执行2次优化递归，或者当检测到干扰峰时进行单次递归更新。

Baseline Analysis

将OTB-100,NFS，UAV123 三个数据集合并进行测试
然后对加入的每一个模块进行实验分析，最后贴一张vot2018测的结果：
在这里插入图片描述

xwmwanjy666

关注

3
点赞
踩
25

收藏

觉得还不错? 一键收藏
0
评论
DiMP:Learning Discriminative Model Prediction for Tracking

文章目录AbstractAbstract目前致力于端到端训练的计算机视觉系统，对于视觉跟踪任务是重要的挑战。与其他视觉问题相反，跟踪需要在推理阶段学习一个鲁棒的基于特定目标的在线外观模型。为了端到端训练，目标模型的在线学习因此需要嵌入到跟踪框架本身中。由于这些困难，流行的Siamese模型简单的预测了一个目标特征模板。然而，这样一个模型由于它不能集成背景信息而拥有有限的判别能力。...
复制链接

扫一扫

专栏目录