视觉Transformer (一) Transformer Tracking

最新推荐文章于 2024-03-08 11:05:49 发布

fling_forever

最新推荐文章于 2024-03-08 11:05:49 发布

阅读量1.1k

点赞数

分类专栏：深度学习目标跟踪文章标签：机器学习计算机视觉

本文链接：https://blog.csdn.net/BearLeer/article/details/118179518

版权

深度学习同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

目标跟踪

11 篇文章 2 订阅

订阅专栏

文章来源

paper: https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Transformer_Tracking_CVPR_2021_paper.html
code:
https://github.com/chenxin-dlut/TransT

Motivation

之前的跟踪大部分都是采用correlation（互相关）融合方法计算模板和搜索帧之间的相似性，然而这种融合会丢失语义信息从而限于局部最优。
从而，作者提出了更好的融合方式，即源于Transformer的attentiion机制。

方法

在这里插入图片描述
作者设计的架构由3部分组成：backbone feature extractor, feature fusion, prediction head.
（一）backbone:
作者采用了resnet50作为backbone，但是在传统的resnet50上，去除了最后一个block5，将block4作为最后一个输出特征。此外，block4的下采样操作的卷积stride从原来的2变为了1。从而可以得到模板特征 $f_z$ 和搜索特征 $f_z$ ：
$f_x \in H_x*H_x*1024$
$f_z \in H_z*H_z*1024$
(二) Feature fusion

将backbone得到的特征，先经过1*1conv，减少通道维度，得到 $f_x \in H_x*H_z*256$ ，然后将其变换为向量形式： $f_x \in N_x*256$ , $N_x = H_x*H_z$
经过特征融合模块：
2.1 ECA （Ego-Context Augment）模块

Input:代表 $f_x$ 或者 $f_z$ ;
ECA模块的输出为：

其中， $P_x \in N_x*256$ 代表X的位置编码（和transformer中的位置编码原理类似）。

2.2 CFA模块 (Cross-Feature Augment)
在这里插入图片描述

FFN代表feed-forward过程，采用了FC->ReLU->FC模块实现。
$P_q, P_{kv}$ 代表位置编码。 $X_q,X_{kv}$ 分别来自两个不同模块的ECA输出。

2.3 ECA和CFA的好处
First, two ego-context augment (ECA) modules focus on the useful semantic context adaptively by multi-head self-attention, to enhance the feature representation. Then, two cross-feature augment (CFA) modules receive the feature maps of their own and the other branch at the same time and fuse these two feature maps through multi-head cross-attention. In this way, two ECAs and two CFAs form a fusion layer, as shown in the dotted box in Figure 2. The fusion layer repeats N times, followed by an additional CFA to fuse the feature map of two branches, decoding a feature map f 2 Rd×HxWx (we employ N = 4 in this work).

与传统transformer区别: （主要提及设计的模块与跟踪的紧密贴合而不是之前的detection）
The method draws on the core idea of Transformer, i.e., employing the attention mechanism. But we do not directly adopt the structure of the Transformer in DETR [4]. Instead, we design a new structure to make it more suitable for tracking framework. The cross-attention operation in our method plays a more important role than that in DETR, since the tracking task focuses on fusing the template and search region features.
3. Prediction Head Network
(1) 由分类分支和回归分支构成，每个分支都采用了3层感知机实现。
(2) 对于最终的融合特征 $\in 1024*N_x$ ，Head网络可以得到 $N_x$ 个前景背景分类结果以及 $N_x$ 个预测坐标结果。

实验细节

The model on the training splits of COCO [24], TrackingNet [30], LaSOT [14], and GOT-10k [19] datasets. The backbone parameters are initialized with ImageNet-pretrained [35] ResNet-50 [18], other parameters of our model are initialized with Xavier init [15]. We train the model with AdamW [25], setting backbone’s learning rate to 1e-5, other parameters’ learning rate to 1e-4, and weight decay to 1e-4. We train the network on two Nvidia Titan RTX GPUs with the batch size of 38, for a total of 1000 epochs with 1000 iterations per epoch. The learning rate decreases by factor 10 after 500 epochs.