文章目录
1、Background
transformer 的方法做单目标跟踪的综述
两大类,CNN-Transformer based 和 Fully-Transformer based
SOT 相关的综述汇总
Transformer 首次在 NLP 中提出
Transformer 在 computer vision 上的应用
通常,Transformer架构需要大量训练样本来训练其模型。由于目标是在跟踪序列的第一帧中给出的,因此在VOT中不可能获得大量样本,因此所有完全基于Transformer和基于CNN-Transformer的跟踪器都使用预先训练的网络,并将其视为骨干模型
CNN-Transformer based 的方法都是 two-stream two-stage 的
feature extraction and feature fusion of target template and search region are done in two distinguishable stages (two stage).
feature fusion 例如 siamRPN 中的相关操作
2、CNN-Transformer based trackers
第一篇 CNN-Transformer SOT 方法
Wang N, Zhou W, Wang J, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1571-1580.
Chen X, Yan B, Zhu J, et al. Transformer tracking[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 8126-8135.
使用 transformer 来进行特征融合
ego-context augment module (ECA) and a cross-feature augment (CFA) module to enhance the self-attention and cross-attention
基于CNN-Transformer的跟踪器成功地优于孪生网络,因为它们使用了可学习的Transformer而不是线性互相关运算
3、Fully-Transformer based trackers
CNN-Transformer的跟踪器很难捕获全局特征表示。
3.1、Two-stream two-stage trackers
Two-stream Two-stage trackers consist of two identical and individual Transformer-based tracking pipelines to extract the
features of the target template and search region. Another Transformer network is then employed to find the relationship between these features. Finally, a prediction head is utilized to locate the target by using the attended features.
第一篇
Xie F, Wang C, Wang G, et al. Learning tracking representations via dual-branch fully transformer networks[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 2688-2697.
Lin L, Fan H, Zhang Z, et al. Swintrack: A simple and strong baseline for transformer tracking[J]. Advances in Neural Information Processing Systems, 2022, 35: 16743-16754.
3.2、One-stream one-stage trackers
没有了相关操作,直接输出结果
MAM模块的计算效率很低,速度比较慢
Ye B, Chang H, Ma B, et al. Joint feature learning and relation modeling for tracking: A one-stream framework[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 341-357.
消除了不必要的背景特征
Lan J P, Cheng Z Q, He J Y, et al. Procontext: Exploring progressive context transformer for tracking[C]//ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023: 1-5.
比 OSTrack 多引入了时间信息
4、Benchmark datasets and evaluation metrics
OTB 数据集上,纯 transformer 的效果没有 CNN 和 CNN-Transformer 的方法效果好
大多数OTB视频的帧数较少,因此目标的外观在许多序列中保持不变。因此,基于CNN的特征提取和匹配显示出优异的跟踪结果
完全基于Transformer的方法的性能主要依赖于它们的时间线索学习和全局特征捕获能力
5、Tracking efficiency analysis
6、Conclusion(own)
-
treating object tracking as a sequence learning problem rather than the template matching
-
Our experimental comparison study clearly shows that One-stream One-stage fully-Transformer trackers significantly outperform other types of trackers and are expected to dominate the single object tracking community for the next couple of years
7、Reference
-
Kugarajeevan J, Kokul T, Ramanan A, et al. Transformers in single object tracking: an experimental survey[J]. IEEE Access, 2023.