CVPR2021：单目标跟踪

咸咸咸虾

已于 2022-03-11 19:45:43 修改

阅读量6.5k

点赞数 9

分类专栏：论文阅读文章标签：计算机视觉目标跟踪

于 2021-06-23 19:00:25 首次发布

本文链接：https://blog.csdn.net/cheetah_buyu/article/details/118156399

版权

论文阅读专栏收录该内容

7 篇文章 6 订阅

订阅专栏

2021.10.8（ps：不是我整理的）
Github上一个单目标跟踪的论文记录：Single Object Tracking Paper List

只做一个记录，暂时不全面，持续更新→CVPR2021接收论文

1.Learning to Filter: Siamese Relation Network for Robust Tracking

论文：CVPR 、arxiv
代码：siamrn
摘要：
Despite the great success of Siamese-based trackers,their performance under complicated scenarios is still not satisfying, especially when there are distractors. To this end, we propose a novel Siamese relation network, which introduces two efficient modules, i.e. Relation Detector (RD) and Refinement Module (RM). RD performs in a metalearning way to obtain a learning ability to filter the distractors from the background while RM aims to effectively integrate the proposed RD into the Siamese framework to generate accurate tracking result. Moreover, to further improve the discriminability and robustness of the tracker, we introduce a contrastive training strategy that attempts not only to learn matching the same target but also to learn how to distinguish the different objects. Therefore, our tracker can achieve accurate tracking results when facing background clutters, fast motion, and occlusion. Experimental results on five popular benchmarks, including VOT2018, VOT2019, OTB100, LaSOT, and UAV123, show that the proposed method is effective and can achieve state-of-the-art results. The code will be available at https://github.com/hqucv/siamrn

2.STMTrack: Template-free Visual Tracking with Space-time Memory Networks

论文地址：CVPR 、arxiv
代码地址：STMTrack
摘要：
Boosting performance of the offline trained siamese trackers is getting harder nowadays since the fixed information of the template cropped from the first frame has been almost thoroughly mined, but they are poorly capable of resisting target appearance changes. Existing trackers with template updating mechanisms rely on time-consuming numerical optimization and complex hand-designed strategies to achieve competitive performance, hindering them from real-time tracking and practical applications. In this paper, we propose a novel tracking framework built on top of a space-time memory network that is competent to make full use of historical information related to the target for better adapting to appearance variations during tracking. Specifically, a novel memory mechanism is introduced, which stores the historical information of the target to guide the tracker to focus on the most informative regions in the current frame.Furthermore, the pixel-level similarity computation of the memory network enables our tracker to generate much more accurate bounding boxes of the target. Extensive experiments and comparisons with many competitive trackers on challenging large-scale benchmarks, OTB-2015, TrackingNet, GOT-10k, LaSOT, UAV123, and VOT2018, show that, without bells and whistles, our tracker outperforms all previous state-of-the-art real-time methods while running at 37 FPS. The code is available at https://github.com/fzh0917/STMTrack.

3.Transformer Tracking

论文：CVPR 、arxiv
代码：TransT
摘要：
Correlation acts as a critical role in the tracking field, especially in recent popular Siamese-based trackers. The correlation operation is a simple fusion manner to consider the similarity between the template and the search region. However, the correlation operation itself is a local linear matching process, leading to lose semantic information and fall into local optimum easily, which may be the bottleneck of designing high-accuracy tracking algorithms. Is there any better feature fusion method than correlation? To address this issue, inspired by Transformer, this work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Specifically, the proposed method includes an ego-context augment module based on self-attention and a cross-feature augment module based on cross-attention. Finally, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head. Experiments show that our TransT achieves very promising results on six challenging datasets, especially on large-scale LaSOT, TrackingNet, and GOT-10k benchmarks. Our tracker runs at approximatively 50 f ps on GPU. Code and models are available at https://github.com/chenxindlut/TransT.

4.LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

论文：CVPR 、arxiv
代码：LightTrack
摘要：
Object tracking has achieved significant progress over the past few years. However, state-of-the-art trackers become increasingly heavy and expensive, which limits their deployments in resource-constrained applications. In this work, we present LightTrack, which uses neural architecture search (NAS) to design more lightweight and efficient object trackers. Comprehensive experiments show that our LightTrack is effective. It can find trackers that achieve superior performance compared to handcrafted SOTA trackers, such as SiamRPN++ [30] and Ocean [56], while using much fewer model Flops and parameters. Moreover, when deployed on resource-constrained mobile chipsets, the discovered trackers run much faster. For example, on Snapdragon 845 Adreno GPU, LightTrack runs 12× faster than Ocean, while using 13× fewer parameters and 38×fewer Flops. Such improvements might narrow the gap between academic models and industrial deployments in object tracking task. LightTrack is released at here.

5.Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation

论文：CVPR
代码：AlphaRefine
摘要：
Visual object tracking aims to precisely estimate the bounding box for the given target, which is a challenging problem due to factors such as deformation and occlusion. Many recent trackers adopt the multiple-stage strategy to improve bounding box estimation. These methods first coarsely locate the target and then refine the initial prediction in the following stages. However, existing approaches still suffer from limited precision, and the coupling of different stages severely restricts the method’s transferability. This work proposes a novel, flexible, and accurate refinement module called Alpha-Refine (AR), which can significantly improve the base trackers’ box estimation quality. By exploring a series of design options, we conclude that the key to successful refinement is extracting and maintaining detailed spatial information as much as possible. Following this principle, Alpha-Refine adopts a pixel-wise correlation, a corner prediction head, and an auxiliary mask head as the core components. Comprehensive experiments on TrackingNet, LaSOT, GOT-10K, and VOT2020 benchmarks with multiple base trackers show that our approach significantly improves the base tracker’s performance with little extra latency. The proposed Alpha-Refine method leads to a series of strengthened trackers, among which the ARSiamRPN (AR strengthened SiamRPNpp) and the ARDiMP50 (AR strengthened DiMP50) achieve good efficiency-precision trade-off, while the ARDiMPsuper (AR strengthened DiMP-super) achieves very competitive performance at a real-time speed. Code and pretrained models are available at https://github.com/MasterBin-IIAU/AlphaRefine.

6.Graph Attention Tracking

论文：CVPR 、arxiv
代码：SiamGAT
摘要：
Siamese network based trackers formulate the visual tracking task as a similarity matching problem. Almost all popular Siamese trackers realize the similarity learning via convolutional feature cross-correlation between a target branch and a search branch. However, since the size of target feature region needs to be pre-fixed, these cross-correlation base methods suffer from either reserving much adverse background information or missing a great deal of foreground information. Moreover, the global matching between the target and search region also largely neglects the target structure and part-level information.
In this paper, to solve the above issues, we propose a simple target-aware Siamese graph attention network for general object tracking. We propose to establish part-to-part correspondence between the target and the search region with a complete bipartite graph, and apply the graph attention mechanism to propagate target information from the template feature to the search feature. Further, instead of using the pre-fixed region cropping for template-feature-area selection, we investigate a target-aware area selection mechanism to fit the size and aspect ratio variations of different objects. Experiments on challenging benchmarks including GOT-10k, UAV123, OTB-100 and LaSOT demonstrate that the proposed SiamGAT outperforms many state-of-the-art trackers and achieves leading performance. Code is available at: https://git.io/SiamGAT

7.Transformer Meets Tracker:Exploiting Temporal Context for Robust Visual Tracking

论文：CVPR 、arxiv
代码：
摘要：
In video object tracking, there exist rich temporal contexts among successive frames, which have been largely overlooked in existing trackers. In this work, we bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches and carefully design them within the Siamese-like tracking pipelines. The transformer encoder promotes the target templates via attention-based feature reinforcement, which benefits the high-quality tracking model generation. The transformer decoder prop-agates the tracking cues from previous templates to the current frame, which facilitates the object searching process. Our transformer-assisted tracking framework is neat and trained in an end-to-end manner. With the proposed transformer, a simple Siamese matching approach is able to out-perform the current top-performing trackers. By combining our transformer with the recent discriminative tracking pipeline, our method sets several new state-of-the-art records on prevalent tracking benchmarks.

8.CapsuleRRT: Relationships-aware Regression Tracking via Capsules

论文：CVPR
代码：
摘要：
Regression tracking has gained more and more attention thanks to its easy-to-implement characteristics, while existing regression trackers rarely consider the relationships between the object parts and the complete object. This would ultimately result in drift from the target object when missing some parts of the target object. Recently, Capsule Network(CapsNet) has shown promising results for image classification benefits from its part-object relationships mechanism, while CapsNet is known for its high computational demand even when carrying out simple tasks. Therefore, a primitive adaptation of CapsNet to regression tracking does not make sense, since this will seriously affect speed of a tracker. To solve these problems, we first explore the spatial-temporal relationships endowed by the CapsNet for regression tracking. The entire regression framework, dubbed CapsuleRRT, consists of three parts. One is S-Caps, which captures the spatial relationships between the parts and the object. Meanwhile, a T-Caps module is designed to exploit the temporal relationships within the target. The response of the target is obtained by STCaps Learning. Further, a prior-guided capsule routing algorithm is proposed to generate more accurate capsule assignments for subsequent frames. Apart from this, the heavy computation burden in CapsNet is addressed with a knowledge distillation pose matrix compression strategy that exploits more tight and discriminative representation with few samples. Extensive experimental results show that CapsuleRRT performs favorably against state-of-the-art methods in terms of accuracy and speed.

9.Progressive Unsupervised Learning for Visual Object Tracking

论文：CVPR
代码：
摘要：
In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. Specifically, we first learn a background discrimination (BD) model that effectively distinguishes an object from background in a contrastive learning way. We then employ the BD model to progressively mine temporal corresponding patches (i.e., patches connected by a track) in sequential frames. As the BD model is imperfect and thus the mined patch pairs are noisy, we propose a noise-robust loss function to more effectively learn temporal correspondences from this noisy data. We use the proposed noise robust loss to train backbone networks of Siamese trackers. Without online fine-tuning or adaptation, our unsupervised real-time Siamese trackers can outperform state-of-the-art unsupervised deep trackers and achieve competitive results to the supervised baselines.

10.Towards More Flexible and Accurate Object Tracking with Natural Language:Algorithms and Benchmark

论文：CVPR 、arxiv
代码：
摘要：
Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can’t reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.

11.Siamese Natural Language Tracker: Tracking by Natural Language Descriptions with Siamese Trackers

论文：CVPR 、arxiv
代码：SNLT
摘要：
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU. Code for this work is available at https://github.com/fredfung007/snlt .

12.Rotation Equivariant Siamese Networks for Tracking

论文：CVPR 、arxiv
代码：re-siamnet
摘要：
Rotation is among the long prevailing, yet still unresolved, hard challenges encountered in visual object tracking. The existing deep learning-based tracking algorithms use regular CNNs that are inherently translation equivariant, but not designed to tackle rotations. In this paper, we first demonstrate that in the presence of rotation instances in videos, the performance of existing trackers is severely affected. To circumvent the adverse effect of rotations, we present rotation-equivariant Siamese networks (RE-SiamNets), built through the use of group-equivariant convolutional layers comprising steerable filters. SiamNets allow estimating the change in orientation of the object in an unsupervised manner, thereby facilitating its use in relative 2D pose estimation as well. We further show that this change in orientation can be used to impose an additional motion constraint in Siamese tracking through imposing restriction on the change in orientation between two consecutive frames. For benchmarking, we present Rotation Tracking Benchmark (RTB), a dataset comprising a set of videos with rotation instances. Through experiments on two popular Siamese architectures, we show that RE-SiamNets handle the problem of rotation very well and out-perform their regular counterparts. Further, RE-SiamNets can accurately estimate the relative change in pose of the target in an unsupervised fashion, namely the in-plane rotation the target has sustained with respect to the reference frame. Code and data can be accessed at https://github.com/dkgupta90/re-siamnet.

咸咸咸虾

关注

9
点赞
踩
67

收藏

觉得还不错? 一键收藏
2
评论
CVPR2021：单目标跟踪

只做一个记录，暂时不全面，持续更新→CVPR2021接收论文1.Learning to Filter: Siamese Relation Network for Robust Tracking论文：CVPR 、arxiv代码：siamrn摘要：Despite the great success of Siamese-based trackers,their performance under complicated scenarios is still notsatisfying, especi
复制链接

扫一扫

专栏目录