CREST: Convolutional Residual Learning for Visual Tracking
Research Background
Given a bounding box annotation in the first frame, developing a tracker that is robust to a variety of challenges is an interest task. The Discriminative Correlation Filter approach (DCFs) contains two superior properties. Firstly, it is suitable for fast tracking because spatial correlation can be calculated via element-wise produce in Fourier domain. Besides, in contrast to tracking-by-detection methods, DCFs regress the crucial shifted versions of input feature to soft labels, ranging from zero to one. Thus DCFs can generate dense response scores over all searching regions rather than sparse ones.
Motivation & Proposed approach
- DCFs based tracking algorithms are usually limited by two aspects. For one thing, learning DCFs and feature extraction are two independent processes, keeping these algorithms away from end-to-end training. For another thing, the tracking filters are usually updated using a moving average operation with an empirical weight, leaving a tough balance between model adaptivity and stability.
- This work interprets DCFs as the counterparts of the convolution filters and reformulate DCFs as a one-layer convolutional neural network. Thus, the response map can be directly generated as the spatial correlation between two consecutive frames.
- As for model updation, this work adopts residual learning method to detect the difference between the output of convolutional layer and the groud-truth soft label.
- Scale estimation
After obtaining target center location, this work extracts search patches in different scales, calculates response map for each, the selects the one with maximum response score as the target object. A soft update is done to estimate the target location.
( w t , h t ) = β ( w t ∗ , h t ∗ ) + ( 1 − β ) ( w t − 1 ∗ , h t − 1 ∗ ) (w_{t},h_{t})=\beta (w_{t}^{*},h_{t}^{*})+(1-\beta )(w_{t-1}^{*},h_{t-1}^{*}) (wt,ht)=β(wt∗,ht∗)+(1−β)(wt−1∗,ht−1∗)
优势和启发
- Correlation Filter的方式,速度快
- 疑惑怎么做spatial residual和temporal residual,以及两者为什么能work?
潜在的不足
- 从欢乐颂视频来看,测试效果一般
- 直接利用ImageNet上预训练的VGG16模型,提取conv4-3的特征,并通过PCA将channel数目调整为64。直接由ImageNet获得的特征可能不适合tracking任务
Visual Object Tracking using Adaptive Correlation Filters
本篇理解主要参考StayFoolishAndHappy的博客
Motivation & Proposed approach
注意:本方法只是在track图像的中心点。
- 学习滤波器
对于输入信号 f f f,假设我们有滤波器 h h h,我们可以得到输出信号 g g g:
g = f ∗ h g = f * h g=f∗h
在频域可以表示为:
G = F ⋅ H ∗ G = F\cdot H^{*} G=F⋅H∗,其中 H ∗ H^{*} H∗为 H H H的共轭
我们期望得到滤波器:
H ∗ = G F H^{*}=\frac{G}{F} H∗=FG
对于给定的训练样本,我们通过最小化损失函数来求解 H ∗ H^{*} H∗:
min H ∗ ∑ i ∣ F i ⋅ H ∗ − G i ∣ \min \limits_{H^{*}}\sum_{i}\left | F_{i}\cdot H^{*}-G_{i} \right | H∗min∑i∣Fi⋅H∗−Gi∣
最终可以得到 H H H的闭式解:
H = ∑ i F i ⋅ G i ∗ ∑ i F i ⋅ F i ∗ H=\frac{\sum_{i}F_{i}\cdot G_{i}^{*}}{\sum_{i}F_{i}\cdot F_{i}^{*}} H=∑iFi⋅Fi∗∑iFi⋅Gi∗
即根据训练样本可以确定滤波器。 - 更新滤波器
为了保存原有的记忆和新学习到的模型,需要将前后两次训练得到的滤波器进行更新,更新方式可以简单地写为:
A = ( 1 − η ) ∗ A + η ∗ A ′ A = (1-\eta)*A + \eta*A^{'} A=(1−η)∗A+η∗A′
优势和启发
- 速度快
- 数学上具有闭式解,易于解释
潜在的不足
- 对第一帧标注的appearance依赖较强,在更新过程中假定当前track的结果可信,导致方法的鲁棒性不强
- 相比DL的方法,缺乏大数据(ImageNet)上训练的model所具有的优势