作者尝试将correlation filter (CF) 添加到深度学习网络框架中,把CF作为CNN的一个特别的层,以期待得到快速且高精度的跟踪性能。把CF在线的有效学习能力与CNN离线阶段的特征判别能力结合在一起进行端到端的训练。


The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame.


Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task.


This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter.


Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.



Deep neural networks are a powerful tool for learning image representations in computer vision applications. However, training deep networks online, in order to capture previously unseen object classes from one or few examples, is challenging. This problem emerges naturally in applications such as visual object tracking, where the goal is to re-detect an object over a video with the sole supervision of a bounding box at the beginning of the sequence. The main challenge is the lack of a-priori knowledge of the target object, which can be of any class.


The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [31, 25, 35]. The extremely limited training data and large number of parameters make this a difficult learning problem. Furthermore, SGD is quite expensive for online adaptation [31, 25].


A possible answer to these shortcomings is to have no online adaptation of the network. Recent works have focused on learning deep embeddings that can be used as universal object descriptors [3, 12, 28, 17, 5]. These methods use a Siamese CNN, trained offline to discriminate whether two image patches contain the same object or not. The idea is that a powerful embedding will allow the detection (and thus tracking) of objects via similarity, bypassing the online learning problem. However, using a fixed metric to compare appearance prevents the learning algorithm from exploiting any video-specific cues that could be helpful for discrimination.

解决这些缺点的一个可能的办法是不在线更新网络。最近的研究集中在学习可以用作通用目标描述符的深度嵌入(deep embeddings)。这些方法使用一个经过离线训练的Siamese CNN来区分两个图像块是否包含相同的目标。其想法是,一个强大的嵌入允许通过相似性来检测目标,从而绕过在线学习的问题。然而,使用固定的度量标准来比较外观会阻止学习算法利用任何有助于辨别的视频特定线索。

An alternative strategy is to use instead an online learning method such as the Correlation Filter (CF). The CF is an efficient algorithm that learns to discriminate an image patch from the surrounding patches by solving a large ridge regression problem extremely efficiently [4, 13]. It has proved to be highly successful in object tracking (e.g. [6, 18, 22, 2]), where its efficiency enables a tracker to adapt its internal model of the object on the fly at every frame. It owes its speed to a Fourier domain formulation, which allows the ridge regression problem to be solved with only a few applications of the Fast Fourier Transform (FFT) and cheap element-wise operations. Such a solution is, by design, much more efficient than an iterative solver like SGD, and still allows the discriminator to be tailored to a specific video, contrary to the embedding methods.


The challenge, then, is to combine the online learning efficiency of the CF with the discriminative power of CNN features trained offline. This has been done in several works (e.g. [21, 7, 9, 31]), which have shown that CNNs and CFs are complementary and their combination results in improved performance.


However, in the aforementioned works, the CF is simply applied on top of pre-trained CNN features, without any deep integration of the two methods. End-to-end training of deep architectures is generally preferable to training individual components separately. The reason is that in this manner the free parameters in all components can co-adapt and cooperate to achieve a single objective. Thus it is natural to ask whether a CNN-CF combination can also be trained end-to-end with similar benefits.

但是在前面的工作中,CF只是简单地应用在预先训练好的CNN 特征上,并没有对两种方法进行深入的整合。深度架构的端到端训练通常比单独训练单个组件更好。其原因是在这种方式下,各组成部分的自由参数可以相互协调,以达到一个统一的目标。因此,人们自然会问,CNN-CF组合是否也可以通过端到端的训练获得类似的好处。

The key step in achieving such integration is to interpret the CF as a differentiable CNN layer, so that errors can be propagated through the CF back to the CNN features. This is challenging, because the CF itself is the solution of a learning problem. Hence, this requires to differentiate the solution of a large linear system of equations. This paper provides a closed-form expression for the derivative of the Correlation Filter. Moreover, we demonstrate the practical utility of our approach in training CNN architectures end-to- end.


We present an extensive investigation into the effect of incorporating the CF into the fully-convolutional Siamese framework of Bertinetto et al. [3]. We find that the CF does not improve results for networks that are sufficiently deep. However, our method enables ultra-lightweight networks of a few thousand parameters to achieve state-of-the-art performance on multiple benchmarks while running at high framerates.


CFNet整体架构。非对称的孪生网络,非对称部分是由于correlation filter部分引起的,作者是想把CF引入到神经网络中,从而能够实现end-to-end优化。training images和testing images首先都经过相同的卷积特征提取网络提取特征,其中training images提取出来的卷积特征进而通过correlation filter得到线性模板,这个线性模板将通过与testing images的特征图进行交叉相关进而得到目标定位的响应图。


Fully convolutional Siamese networks SiamFC--baseline

Tracking algorithm

The network itself only provides a function to measure the similarity of two image patches. To apply this network to object tracking, it is necessary to combine this with a procedure that describes the logic of the tracker. Similar to [3], we employ a simplistic tracking algorithm to assess the utility of the similarity function.

Online tracking is performed by simply evaluating the network in forward-mode. The feature representation of the target object is compared to that of the search region, which is obtained in each new frame by extracting a window centred at the previously estimated position, with an area that is four times the size of the object. The new position of the object is taken to be the location with the highest score.

The original fully-convolutional Siamese network simply compared every frame to the initial appearance of the object. In contrast, we compute a new template in each frame and then combine this with the previous template in a moving average.



Correlation Filter networks

We propose to modify the baseline Siamese network of eq. 1 with a Correlation Filter block between x and the cross-correlation operator. The resulting architecture is illustrated in Figure 1. This change can be formalized as:

后文CF即Correlation Filter,由系统框架图可以看到,本文在一个输入支路上x和cross-correlation操作之间加入CF block,在公式(1)的基础上引入两个参数s,b为了让响应图值的范围更适合逻辑回归。公式如下:

w=ω(x) 就是CF block,它计算出一个标准的CF模板(KCF中在傅里叶域解岭回归问题得到的w),效果可以理解为对translation鲁棒的判别模。CFNet前向传播的时候就是一个加入了CNN特征的CF跟踪器,但是,之前的算法并不能端到端训练CF,本文就是推导了CF中的模板的对输入的导数使得CF也能够被端到端训练。


