Fully-Convolutional Siamese Networks for Object Tracking—ECCVW2016 阅读

最新推荐文章于 2022-11-28 16:40:43 发布

deason_yuan

最新推荐文章于 2022-11-28 16:40:43 发布

阅读量224

点赞数

分类专栏： Siamese based tracker

本文链接：https://blog.csdn.net/qq_34563519/article/details/107250211

版权

Siamese based tracker 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

提出一种基于全卷积Siamese network的基本追踪算法模型.

Abstract.

The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself.

通常任意目标的跟踪问题是通过在线学习目标的外观模型（只使用视频本身作为训练数据）来解决的。

Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn.

尽管在这些方法取得了一些成功，但是这些只是在线学习的方法本质上限制了模型可以学习的丰富程度。（也就是说模型不单单可以学习到这些信息，还可以学习其他的信息……）

Recently, several attempts have been made to exploit the expressive power of deep convolutional networks.

最近人们开始尝试开发利用深度卷积网络的表达能力。

However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system.

但是，当事先不知道需要跟踪的目标的时候，需要在线进行随机梯度下降来更新网络的权重从而使模型能够适用于要跟踪的目标，这严重影响了跟踪系统的速度。

In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video.

在本文中，作者在ILSVRC15数据集上训练了一个全新的全卷积Siamese网络用于目标跟踪中的目标检测过程。

Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.

尽管方法相对比较简单，跟踪速度超实时并且性能达到SOTA标准。

Introduction.

We consider the problem of tracking an arbitrary object in video, where the object is identified solely by a rectangle in the first frame. Since the algorithm may be requested to track any arbitrary object, it is impossible to have already gathered data and trained a specific detector.

考虑到要跟踪的目标是任意形式的，并在序列第一帧被矩形框标出。因此不可能在已经收集到的数据上训练一个特定的检测器，训练的模型（检测器）应该是通用的。

For several years, the most successful paradigm for this scenario has been to learn a model of the object’s appearance in an online fashion using examples extracted from the video itself. However, a clear deficiency of using data derived exclusively from the current video is that only comparatively simple models can be learnt. While other problems in computer vision have seen an increasingly pervasive adoption of deep convolutional networks conv-nets) trained from large supervised datasets, the scarcity of supervised data and the constraint of real-time operation prevent the naive application of deep learning within this paradigm of learning a detector per video.

使用当前视频数据的明显缺陷是只能学习到相对简单的模型。在其他计算机视觉领域采用大型有监督数据集训练的深度卷积网络被广泛采用，但是由于数据的稀缺性以及实时操作的限制，限制了深度学习在视频中学习检测器模型的应用。

Several recent works have aimed to overcome this limitation using a pretrained deep conv-net that was learnt for a different but related task. These approaches either apply “shallow” methods (e.g. correlation filters) using the network’s internal representation as features [5,6] or perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network [7–9]. While the use of shallow methods does not take full advantage of the benefits of end-to-end learning, methods that apply SGD during tracking to achieve state-of-the-art results have not been able to operate in real-time.

利用预训练的深度卷积网络来克服数据不足的限制，但是这些方法要么使用浅层特征要么采用随机梯度下降来微调网络。使用浅层特征不能有效利用端到端的优势，利用随机梯度下降又不能达到实时跟踪的效果。

We advocate an alternative approach in which a deep conv-net is trained to address a more general similarity learning problem in an initial offline phase, and then this function is simply evaluated online during tracking. The key contribution of this paper is to demonstrate that this approach achieves very competitive performance in modern tracking benchmarks at speeds that far exceed the frame-rate requirement. Specifically, we train a Siamese network to locate an exemplar image within a larger search image. A further contribution is a novel Siamese architecture that is fully-convolutional with respect to the search image: dense and efficient sliding-window evaluation is achieved with a bilinear layer that computes the cross-correlation of its two inputs.

本文提出在初始离线训练阶段训练一个深度卷积网络来解决一个一般的相似性学习问题，在跟踪过程中简单地在线评估这个相似性度量函数。第一个贡献就是性能好，速度快：具体而言就是训练一个Siamese网络在一个更大的搜索区域定位一个目标图像。第二个贡献就是提出了一种新的Siamese架构，其对于收索图像是全卷积的：通过双线性层计算两个输入的互相关联，实现了密集而高效的滑动窗口评估。

Deep Similarity Learning for Tracking

Learning to track arbitrary objects can be addressed using similarity learning. We propose to learn a function f(z, x) that compares an exemplar image z to a candidate image x of the same size and returns a high score if the two images depict the same object and a low score otherwise. To find the position of the object in a new image, we can then exhaustively test all possible locations and choose the candidate with the maximum similarity to the past appearance of the object. In experiments, we will simply use the initial appearance of the object as the exemplar. The function f will be learnt from a dataset of videos with labelled object trajectories.

学习一个相似性度量函数，如果两幅图像是同一个目标返回一个高分，反之返回低分。

Given their widespread success in computer vision [13–16], we will use a deep conv-net as the function f. Similarity learning with deep conv-nets is typically addressed using Siamese architectures [17–19]. Siamese networks apply an identical transformation ϕ to both inputs and then combine their representations using another function g according to f(z, x) = g(ϕ(z), ϕ(x)). When the function g is a simple distance or similarity metric, the function ϕ can be considered an embedding. Deep Siamese conv-nets have previously been applied to tasks such as face verification [14,18,20], keypoint descriptor learning [19,21] and one-shot character recognition [22].

Fig.1. Fully-convolutional Siamese architecture. Our architecture is fully convolutional with respect to the search image x. The output is a scalar-valued score map whose dimension depends on the size of the search image. This enables the similarity function to be computed for all translated sub-windows within the search image in one evaluation. In this example, the red and blue pixels in the score map contain the similarities for the corresponding sub-windows.

参考

1. Fully-Convolutional Siamese Networks for Object Tracking （https://link.springer.com/content/pdf/10.1007%2F978-3-319-48881-3_56.pdf）

2. https://blog.csdn.net/weixin_39467358/article/details/83858569

3. https://zhuanlan.zhihu.com/p/48249914

4. https://www.pianshen.com/article/9001324660/

5. https://blog.csdn.net/sgfmby1994/article/details/79865866

deason_yuan

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Fully-Convolutional Siamese Networks for Object Tracking—ECCVW2016 阅读

提出一种基于全卷积Siamese network的基本追踪算法模型.Abstract.The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself.通常任意目标的跟踪问题是通过在线学习目标的.
复制链接

扫一扫

专栏目录