Video Tracking--初识

CREST: Convolutional Residual Learning for Visual Tracking

在这里插入图片描述

Research Background

Given a bounding box annotation in the first frame, developing a tracker that is robust to a variety of challenges is an interest task. The Discriminative Correlation Filter approach (DCFs) contains two superior properties. Firstly, it is suitable for fast tracking because spatial correlation can be calculated via element-wise produce in Fourier domain. Besides, in contrast to tracking-by-detection methods, DCFs regress the crucial shifted versions of input feature to soft labels, ranging from zero to one. Thus DCFs can generate dense response scores over all searching regions rather than sparse ones.

Motivation & Proposed approach
  1. DCFs based tracking algorithms are usually limited by two aspects. For one thing, learning DCFs and feature extraction are two independent processes, keeping these algorithms away from end-to-end training. For another thing, the tracking filters are usually updated using a moving average operation with an empirical weight, leaving a tough balance between model adaptivity and stability.
  2. This work interprets DCFs as the counterparts of the convolution filters and reformulate DCFs as a one-layer convolutional neural network. Thus, the response map can be directly generated as the spatial correlation between two consecutive frames.
  3. As for model updation, this work adopts residual learning method to detect the difference between the output of convolutional layer and the groud-truth soft label.
  4. Scale estimation
    After obtaining target center location, this work extracts search patches in different scales, calculates response map for each, the selects the one with maximum response score as the target object. A soft update is done to estimate the target location.
    ( w t , h t ) = β ( w t ∗ , h t ∗ ) + ( 1 − β ) ( w t − 1 ∗ , h t − 1 ∗ ) (w_{t},h_{t})=\beta (w_{t}^{*},h_{t}^{*})+(1-\beta )(w_{t-1}^{*},h_{t-1}^{*}) (wt,ht)=β(wt,ht)+(1β)(wt1,ht1)
优势和启发
  1. Correlation Filter的方式,速度快
  2. 疑惑怎么做spatial residual和temporal residual,以及两者为什么能work?
潜在的不足
  1. 从欢乐颂视频来看,测试效果一般
  2. 直接利用ImageNet上预训练的VGG16模型,提取conv4-3的特征,并通过PCA将channel数目调整为64。直接由ImageNet获得的特征可能不适合tracking任务

Visual Object Tracking using Adaptive Correlation Filters

在这里插入图片描述
本篇理解主要参考StayFoolishAndHappy的博客

Motivation & Proposed approach

注意:本方法只是在track图像的中心点

  1. 学习滤波器
    对于输入信号 f f f,假设我们有滤波器 h h h,我们可以得到输出信号 g g g:
    g = f ∗ h g = f * h g=fh
    在频域可以表示为:
    G = F ⋅ H ∗ G = F\cdot H^{*} G=FH,其中 H ∗ H^{*} H H H H的共轭
    我们期望得到滤波器:
    H ∗ = G F H^{*}=\frac{G}{F} H=FG
    对于给定的训练样本,我们通过最小化损失函数来求解 H ∗ H^{*} H
    min ⁡ H ∗ ∑ i ∣ F i ⋅ H ∗ − G i ∣ \min \limits_{H^{*}}\sum_{i}\left | F_{i}\cdot H^{*}-G_{i} \right | HminiFiHGi
    最终可以得到 H H H的闭式解:
    H = ∑ i F i ⋅ G i ∗ ∑ i F i ⋅ F i ∗ H=\frac{\sum_{i}F_{i}\cdot G_{i}^{*}}{\sum_{i}F_{i}\cdot F_{i}^{*}} H=iFiFiiFiGi
    即根据训练样本可以确定滤波器。
  2. 更新滤波器
    为了保存原有的记忆和新学习到的模型,需要将前后两次训练得到的滤波器进行更新,更新方式可以简单地写为:
    A = ( 1 − η ) ∗ A + η ∗ A ′ A = (1-\eta)*A + \eta*A^{'} A=(1η)A+ηA
优势和启发
  1. 速度快
  2. 数学上具有闭式解,易于解释
潜在的不足
  1. 对第一帧标注的appearance依赖较强,在更新过程中假定当前track的结果可信,导致方法的鲁棒性不强
  2. 相比DL的方法,缺乏大数据(ImageNet)上训练的model所具有的优势
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值