文章来源
paper: https://openaccess.thecvf.com/content/CVPR2021/html/Chen_Transformer_Tracking_CVPR_2021_paper.html
code:
https://github.com/chenxin-dlut/TransT
Motivation
之前的跟踪大部分都是采用correlation(互相关)融合方法计算模板和搜索帧之间的相似性,然而这种融合会丢失语义信息从而限于局部最优。
从而,作者提出了更好的融合方式,即源于Transformer的attentiion机制。
方法
作者设计的架构由3部分组成:backbone feature extractor, feature fusion, prediction head.
(一)backbone:
作者采用了resnet50作为backbone,但是在传统的resnet50上,去除了最后一个block5,将block4作为最后一个输出特征。此外,block4的下采样操作的卷积stride从原来的2变为了1。从而可以得到模板特征
f
z
f_z
fz和搜索特征
f
z
f_z
fz:
f
x
∈
H
x
∗
H
x
∗
1024
f_x \in H_x*H_x*1024
fx∈Hx∗Hx∗1024
f
z
∈
H
z
∗
H
z
∗
1024
f_z \in H_z*H_z*1024
fz∈Hz∗Hz∗1024
(二) Feature fusion
- 将backbone得到的特征,先经过1*1conv,减少通道维度,得到 f x ∈ H x ∗ H z ∗ 256 f_x \in H_x*H_z*256 fx∈Hx∗Hz∗256,然后将其变换为向量形式: f x ∈ N x ∗ 256 f_x \in N_x*256 fx∈Nx∗256, N x = H x ∗ H z N_x = H_x*H_z Nx=Hx∗Hz
- 经过特征融合模块:
2.1 ECA (Ego-Context Augment)模块
Input:代表 f x f_x fx或者 f z f_z fz;
ECA模块的输出为:
其中, P x ∈ N x ∗ 256 P_x \in N_x*256 Px∈Nx∗256 代表X的位置编码(和transformer中的位置编码原理类似)。
2.2 CFA模块 (Cross-Feature Augment)
FFN代表feed-forward过程,采用了FC->ReLU->FC模块实现。
P
q
,
P
k
v
P_q, P_{kv}
Pq,Pkv代表位置编码。
X
q
,
X
k
v
X_q,X_{kv}
Xq,Xkv分别来自两个不同模块的ECA输出。
2.3 ECA和CFA的好处
First, two ego-context augment (ECA) modules focus on the useful semantic context adaptively by multi-head self-attention, to enhance the feature representation. Then, two cross-feature augment (CFA) modules receive the feature maps of their own and the other branch at the same time and fuse these two feature maps through multi-head cross-attention. In this way, two ECAs and two CFAs form a fusion layer, as shown in the dotted box in Figure 2. The fusion layer repeats N times, followed by an additional CFA to fuse the feature map of two branches, decoding a feature map f 2 Rd×HxWx (we employ N = 4 in this work).
与传统transformer区别: (主要提及设计的模块与跟踪的紧密贴合而不是之前的detection)
The method draws on the core idea of Transformer, i.e., employing the attention mechanism. But we do not directly adopt the structure of the Transformer in DETR [4]. Instead, we design a new structure to make it more suitable for tracking framework. The cross-attention operation in our method plays a more important role than that in DETR, since the tracking task focuses on fusing the template and search region features.
3. Prediction Head Network
(1) 由分类分支和回归分支构成,每个分支都采用了3层感知机实现。
(2) 对于最终的融合特征
f
∈
1024
∗
N
x
f \in 1024*N_x
f∈1024∗Nx,Head网络可以得到
N
x
N_x
Nx个前景背景分类结果以及
N
x
N_x
Nx个预测坐标结果。
实验细节
The model on the training splits of COCO [24], TrackingNet [30], LaSOT [14], and GOT-10k [19] datasets. The backbone parameters are initialized with ImageNet-pretrained [35] ResNet-50 [18], other parameters of our model are initialized with Xavier init [15]. We train the model with AdamW [25], setting backbone’s learning rate to 1e-5, other parameters’ learning rate to 1e-4, and weight decay to 1e-4. We train the network on two Nvidia Titan RTX GPUs with the batch size of 38, for a total of 1000 epochs with 1000 iterations per epoch. The learning rate decreases by factor 10 after 500 epochs.
结论
文章的transformer方法较为新颖,可以嵌入到自己的方法中。