[论文阅读 2021 CVPR-oral 目标跟踪]Transformer Meets Tracker Exploiting Temporal Context for Robust Visual

最新推荐文章于 2025-03-20 10:15:58 发布

lingqing97

最新推荐文章于 2025-03-20 10:15:58 发布

阅读量4.5k

点赞数 9

分类专栏：论文阅读文章标签：目标跟踪计算机视觉机器学习人工智能

本文链接：https://blog.csdn.net/qq_39621037/article/details/115189929

版权

论文阅读专栏收录该内容

19 篇文章

订阅专栏

简介

paper:Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

code:594422814/TransformerTrack

盼着盼着它来了！这篇论文将Transfomer引入了单目标跟踪任务中，且取得了很好的效果。这篇论文提供了一个基于Transfomer的中间模块，通过该中间模块可以显著提升提取的特征质量。

在这里插入图片描述

主要内容

在这里插入图片描述

不同于原始的Transfomrer，这篇论文将encoder和decoder两部分分开，其中encoder部分对backbone提取的Template Feature通过Attention进行特征加强。而Search Feature则经由decoder进行处理。

Transformer Encoder

输入到encoder的是一系列template feature,即 $T=\left(\mathbf{T}_{1}, \cdots, \mathbf{T}_{n}\right) \in \mathbb{R}^{n \times C \times H \times W}$ ，为了便于计算作者将 $T$ 转换为 $T^{'} \in \mathbb{R}^{N_{T} \times C}$ ,其中 $N_{T}=n \times H \times W$

最终对于encoder部分可以表示为:

$\mathbf{A}_{\mathrm{T} \rightarrow \mathrm{T}}=\operatorname{Atten}\left(\varphi\left(\mathbf{T}^{\prime}\right), \varphi\left(\mathbf{T}^{\prime}\right)\right) \in \mathbb{R}^{N_{T} \times N_{T}}$

$\hat{\mathbf{T}}=\text { Ins. } \operatorname{Norm}\left(\mathbf{A}_{\mathrm{T} \rightarrow \mathrm{T}} \mathbf{T}^{\prime}+\mathbf{T}^{\prime}\right)$

where ϕ(·) is a 1 × 1 linear transformation that reduces the embedding
channel from C to C/4.

通过self-attention,最终可以得到很高质量的template features.

Thanks to the self-attention, multiple temporally diverse template features aggregate each other to generate high quality $\hat{\mathbf{T}}$

Transformer Decoder

首先同encoder一样，decoder的第一部分也是采用一个Self-Attention的结构且与encoder部分的Self-Attention共享权重，这么做的目的是保证将Template Feature和Search Feature映射到同一特征空间中.

同encoder中的处理一样，对于Search feature,即 $\in \mathbb{R}^{C \times H \times W}$ ，同样reshape为 $\mathbf{S}^{\prime} \in\ \mathbb{R}^{N_S \times C}$ ,则第一个Self-Attention部分可以表示为:

$\hat{\mathbf{S}}=\text { Ins. } \operatorname{Norm}\left(\mathbf{A}_{\mathbf{S} \rightarrow \mathbf{S}} \mathbf{S}^{\prime}+\mathbf{S}^{\prime}\right)$

where $\mathbf{A}_{\mathbf{S} \rightarrow \mathbf{S}}=\operatorname{Atten}\left(\varphi\left(\mathbf{S}^{\prime}\right), \varphi\left(\mathbf{S}^{\prime}\right)\right) \in \mathbb{R}^{N_{S} \times N_{S}}$ is the self-attention matrix of the search feature.

之后论文将decoder划分为Feature Transformation和Mask Transformation.

Mask Transformation

对于Mask Transformation，按我的理解是学习上空间上的Attention.

其中Mask Transformation的对于输入K为encoder提取到的 $\hat{\mathbf{T}}$ ，而对于Q则为上一层输出的 $\hat{\mathbf{S}}$ ,输入V是一个Gaussian-shaped的mask,即 $\mathbf{M}=\operatorname{Concat}\left(\mathbf{m}_{1}, \cdots, \mathbf{m}_{n}\right) \in \mathbb{R}^{n \times H \times W}$ ( $m$ 通过Gaussian function得到，即 $\mathbf{m}(y)=\exp \left(-\frac{\|y-c\|^{2}}{2 \sigma^{2}}\right)$ )，同样对 $M$ 进行reshape后得到 $\mathbf{M}^{\prime} \in \mathbb{R}^{N_{T} \times 1}$ .

最终Mask Transformation可以表示为:

$\mathbf{A}_{\mathrm{T} \rightarrow \mathrm{S}}=\operatorname{Atten}(\phi(\hat{\mathbf{S}}), \phi(\hat{\mathbf{T}})) \in \mathbb{R}^{N_{S} \times N_{T}}$

$\hat{\mathbf{S}}_{\text {mask }}=\text { Ins. } \operatorname{Norm}\left(\mathbf{A}_{\mathrm{T} \rightarrow s} \mathbf{M}^{\prime} \otimes \hat{\mathbf{S}}\right)$

where $\otimes$ is the broadcasting element-wise multiplication

Feature Transformation

对于Feature Transformation,按我的理解是通过Attention提升提取到的Search Feature.

同Mask Transformation基本一样，不同的是输入的V是经过与 $M$ 经过element-wise multiplication的 $\hat{\mathbf{T}}$ ,即 $\hat{\mathbf{T}} \otimes \mathbf{M}^{\prime}$

所以Feature Transformation可以表示为:

$\hat{\mathbf{S}}_{\mathrm{feat}}=\text { Ins. } \operatorname{Norm}\left(\mathbf{A}_{\mathrm{T} \rightarrow \mathrm{S}}\left(\hat{\mathbf{T}} \otimes \mathbf{M}^{\prime}\right)+\hat{\mathbf{S}}\right)$