End-to-end representation learning for Correlation Filter based tracking 阅读

deason_yuan

于 2020-07-11 15:22:27 发布

阅读量461

点赞数 1

分类专栏： Siamese based tracker

本文链接：https://blog.csdn.net/qq_34563519/article/details/107285377

版权

Siamese based tracker 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

作者尝试将correlation filter （CF）添加到深度学习网络框架中，把CF作为CNN的一个特别的层，以期待得到快速且高精度的跟踪性能。把CF在线的有效学习能力与CNN离线阶段的特征判别能力结合在一起进行端到端的训练。

Abstract：

The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame.

强调CF跟踪框架可以提供一个快速的解，从而使得跟踪器能够达到很快的跟踪速度。

Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task.

之前采用CF跟踪框架的跟踪器多采用手工特征或者一些在其他任务上训练的网络作为特征提取器（例如VGGNet）。

This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter.

这个工作主要是把CF（具有封闭解）作为一个神经网络的可微层进行训练，这样就可以学习与CF紧密关联的深度特征。

Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.

实验表明训练得到的轻量级架构能够达到SOTA的精度和高帧率的跟踪速度。

Introduction

Deep neural networks are a powerful tool for learning image representations in computer vision applications. However, training deep networks online, in order to capture previously unseen object classes from one or few examples, is challenging. This problem emerges naturally in applications such as visual object tracking, where the goal is to re-detect an object over a video with the sole supervision of a bounding box at the beginning of the sequence. The main challenge is the lack of a-priori knowledge of the target object, which can be of any class.

深度神经网络是一个学习图像表示的强大工具。然而，在线训练深度网络，以便从一个或几个例子中捕捉以前未见过的对象类具有很大的挑战。这个问题目标跟踪之类的应用中很自然地就会出现，在这些应用中，在序列首帧给出目标，在视频后续帧中检测目标的位置。其主要的挑战是缺乏目标对象的先验知识。

The simplest approach is to disregard the lack of a-priori knowledge and adapt a pre-trained deep convolutional neural network (CNN) to the target, for example by using stochastic gradient descent (SGD), the workhorse of deep network optimization [31, 25, 35]. The extremely limited training data and large number of parameters make this a difficult learning problem. Furthermore, SGD is quite expensive for online adaptation [31, 25].

最简单的方法是忽略先验知识的缺乏，采用预先训练的CNN网络来处理当前的任务，采用随机梯度下降调整网络参数。但是有限的训练数据和大量的参数使得问题变得困难。此外，SGD对于在线适应来说非常昂贵。（也就是说采用SGD会使得跟踪速度变得很慢）

A possible answer to these shortcomings is to have no online adaptation of the network. Recent works have focused on learning deep embeddings that can be used as universal object descriptors [3, 12, 28, 17, 5]. These methods use a Siamese CNN, trained offline to discriminate whether two image patches contain the same object or not. The idea is that a powerful embedding will allow the detection (and thus tracking) of objects via similarity, bypassing the online learning problem. However, using a fixed metric to compare appearance prevents the learning algorithm from exploiting any video-specific cues that could be helpful for discrimination.

解决这些缺点的一个可能的办法是不在线更新网络。最近的研究集中在学习可以用作通用目标描述符的深度嵌入(deep embeddings)。这些方法使用一个经过离线训练的Siamese CNN来区分两个图像块是否包含相同的目标。其想法是，一个强大的嵌入允许通过相似性来检测目标，从而绕过在线学习的问题。然而，使用固定的度量标准来比较外观会阻止学习算法利用任何有助于辨别的视频特定线索。

An alternative strategy is to use instead an online learning method such as the Correlation Filter (CF). The CF is an efficient algorithm that learns to discriminate an image patch from the surrounding patches by solving a large ridge regression problem extremely efficiently [4, 13]. It has proved to be highly successful in object tracking (e.g. [6, 18, 22, 2]), where its efficiency enables a tracker to adapt its internal model of the object on the fly at every frame. It owes its speed to a Fourier domain formulation, which allows the ridge regression problem to be solved with only a few applications of the Fast Fourier Transform (FFT) and cheap element-wise operations. Such a solution is, by design, much more efficient than an iterative solver like SGD, and still allows the discriminator to be tailored to a specific video, contrary to the embedding methods.

另一种策略是使用在线学习方法，如相关滤波器(CF)。CF是一种高效的算法，它通过极其高效地解决一个大的岭回归问题来学会从周围的patch中区分出一个图像patch。它在目标跟踪方面被证明是非常成功的，其效率使跟踪器能够在每一帧动态地调整其内部模型的目标。它的速度归功于傅里叶域公式，它允许岭回归问题的解决只需要快速傅里叶变换(FFT)的少数应用和便宜的点乘的操作。这样的解决方案，从设计上来说，比像SGD这样的迭代求解器更有效。

The challenge, then, is to combine the online learning efficiency of the CF with the discriminative power of CNN features trained offline. This has been done in several works (e.g. [21, 7, 9, 31]), which have shown that CNNs and CFs are complementary and their combination results in improved performance.

那么，挑战就在于如何将CF的在线学习效率与CNN离线训练的特征识别能力结合起来。一些研究已经证明，CNN和CFs是互补的，它们的结合可以提高性能。

However, in the aforementioned works, the CF is simply applied on top of pre-trained CNN features, without any deep integration of the two methods. End-to-end training of deep architectures is generally preferable to training individual components separately. The reason is that in this manner the free parameters in all components can co-adapt and cooperate to achieve a single objective. Thus it is natural to ask whether a CNN-CF combination can also be trained end-to-end with similar benefits.

但是在前面的工作中，CF只是简单地应用在预先训练好的CNN 特征上，并没有对两种方法进行深入的整合。深度架构的端到端训练通常比单独训练单个组件更好。其原因是在这种方式下，各组成部分的自由参数可以相互协调，以达到一个统一的目标。因此，人们自然会问，CNN-CF组合是否也可以通过端到端的训练获得类似的好处。

The key step in achieving such integration is to interpret the CF as a differentiable CNN layer, so that errors can be propagated through the CF back to the CNN features. This is challenging, because the CF itself is the solution of a learning problem. Hence, this requires to differentiate the solution of a large linear system of equations. This paper provides a closed-form expression for the derivative of the Correlation Filter. Moreover, we demonstrate the practical utility of our approach in training CNN architectures end-to- end.

实现这种集成的关键步骤是将CF解释为一个可微的CNN层，这样误差就可以通过CF传播回CNN特征。这很有挑战性，因为CF本身就是学习问题的解决方案。因此，这需要微分解一个大的线性方程组。本文给出了相关滤波器导数的一个封闭表达式。此外，作者证明了他们方法在训练CNN架构端到端实用性。

We present an extensive investigation into the effect of incorporating the CF into the fully-convolutional Siamese framework of Bertinetto et al. [3]. We find that the CF does not improve results for networks that are sufficiently deep. However, our method enables ultra-lightweight networks of a few thousand parameters to achieve state-of-the-art performance on multiple benchmarks while running at high framerates.

对将CF合并到全卷积Siamese框架中的效果进行了广泛的研究。作者发现，对于深度足够深的网络，CF不能改善结果。然而，作者的方法使拥有几千个参数的超轻量级网络能够在高帧率运行的同时，在多个基准测试中实现最新的性能。

CFNet整体架构。非对称的孪生网络，非对称部分是由于correlation filter部分引起的，作者是想把CF引入到神经网络中，从而能够实现end-to-end优化。training images和testing images首先都经过相同的卷积特征提取网络提取特征，其中training images提取出来的卷积特征进而通过correlation filter得到线性模板，这个线性模板将通过与testing images的特征图进行交叉相关进而得到目标定位的响应图。

Method

Fully convolutional Siamese networks （SiamFC--baseline）

Tracking algorithm

The network itself only provides a function to measure the similarity of two image patches. To apply this network to object tracking, it is necessary to combine this with a procedure that describes the logic of the tracker. Similar to [3], we employ a simplistic tracking algorithm to assess the utility of the similarity function.

Online tracking is performed by simply evaluating the network in forward-mode. The feature representation of the target object is compared to that of the search region, which is obtained in each new frame by extracting a window centred at the previously estimated position, with an area that is four times the size of the object. The new position of the object is taken to be the location with the highest score.

The original fully-convolutional Siamese network simply compared every frame to the initial appearance of the object. In contrast, we compute a new template in each frame and then combine this with the previous template in a moving average.

网络本身只是衡量两个图像块的相似性，在线跟踪通过评估网络前向传播的来进行。在新的一帧，以前一帧估计的目标位置为中心提取一个搜索区域，将目标的特征同搜索区域的特征比较，目标新的位置就是在得分最高的位置.先前的Siamese-FC网络仅仅是将每帧同目标的初始外观比较，本文中，每帧都会结合前一个模板算得一个新模板。

Correlation Filter networks

We propose to modify the baseline Siamese network of eq. 1 with a Correlation Filter block between x and the cross-correlation operator. The resulting architecture is illustrated in Figure 1. This change can be formalized as:

后文CF即Correlation Filter,由系统框架图可以看到，本文在一个输入支路上x和cross-correlation操作之间加入CF block，在公式（1）的基础上引入两个参数s,b为了让响应图值的范围更适合逻辑回归。公式如下：

w=ω(x) 就是CF block，它计算出一个标准的CF模板（KCF中在傅里叶域解岭回归问题得到的w），效果可以理解为对translation鲁棒的判别模。CFNet前向传播的时候就是一个加入了CNN特征的CF跟踪器，但是，之前的算法并不能端到端训练CF，本文就是推导了CF中的模板的对输入的导数使得CF也能够被端到端训练。

参考

1. End-to-end representation learning for Correlation Filter based tracking （https://openaccess.thecvf.com/content_cvpr_2017/papers/Valmadre_End-To-End_Representation_Learning_CVPR_2017_paper.pdf）

2. https://blog.csdn.net/cuclxt/article/details/72231430

3. https://blog.csdn.net/shenziheng1/article/details/80943494

4. https://www.jianshu.com/p/1fc6e796e17b