简介
paper:SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines
code:MegviiDetection/video_analyst
这篇论文在SiamFC
基础上进行了改进,提出了SiamFC++
,并在OTB2015
,VOT2018
,LaSOT
,GOT-10k
和TrackingNet
上取得了SOTA.
这篇论文基于以下原则设计了SiamFC++
:
- G1: (decomposition of classification and state estimation) The tracker should perform two sub-tasks: classification and state estimation.(跟踪模型应该包含分类和状态估计两个子任务)
- G2: (non-ambiguous scoring) The classification score should represent the confidence score of target existence directly, in the ”field of view”, i.e. subwindow of the corresponding pixel, rather than the pre-defined settings like anchor boxes.(跟踪模型需要无歧义的评分机制)
- G3: (prior knowledge-free) Tracking approaches should be free of prior knowledge like scale/ratio distribution.(跟踪模型不应该要求有太多先验知识)
- G4: (estimation quality assessment) An estimation quality score independent of classification should be used.(需要对跟踪的质量进行评估)
相关工作
现在大多数的跟踪模型的target state estimation
可以划分为三大类:
- Rescaling the search patch into multiple scales and assembling a mini-batch of scaled images.(按比例扩缩搜索框来应对尺度的变化,比如
DCF
和SiamFC
) - The coarse initial location of the target obtained by classification is iteratively refined for accurate box estimation.(迭代优化通过分类获得初始目标框,比如
ATOM
) - RPN regresses the location shift and size difference between pre-defined anchor boxes and target location.(通过RPN回归预测目标位置与anchor之间的位置偏移和大小差异,比如
SiamRPN
)
主要内容
如上图所示是SiamFC++
的主要框架,其中蓝色部分和红色部分是相比于原版SiamFC
新添加的分支. 不同于SiamFC
,SiamFC++
多了quality
评估分支和regression bbox
预测分支.
Siamese-based Feature Extraction and Matching
通过cross-correlation
计算embedding feature
的过程可以表示为:
f i ( z , x ) = ψ i ( ϕ ( z ) ) ⋆ ψ i ( ϕ ( x ) ) , i ∈ { c l s , r e g } f_{i}(z, x)=\psi_{i}(\phi(z)) \star \psi_{i}(\phi(x)), i \in\{\mathrm{cls}, \mathrm{reg}\} fi(z,x)=ψi(ϕ(z))⋆ψi(ϕ(x)),i∈{cls,reg}
both ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg after common feature extraction to adjust the common features into task-specific feature space. Note that the extracted features of ψ c l s ψ_{cls} ψcls and ψ r e g ψ_{reg} ψreg are of the same size.
Application of Design Guidelines in Head Network
在提取了embedding feature
后,设计了classification head
和regression head
将模型划分为分类任务和回归任务(依据原则G1).
regression head
分支输出一个4维的向量
t
∗
=
(
l
∗
,
t
∗
,
r
∗
,
b
∗
)
\boldsymbol{t}^{*}=\left(l^{*}, t^{*}, r^{*}, b^{*}\right)
t∗=(l∗,t∗,r∗,b∗),各元素代表的含义如下(其中s
表示backbone
网络的步长,这篇论文中是8):
l ∗ = ( ⌊ s 2 ⌋ + x s ) − x 0 , t ∗ = ( ⌊ s 2 ⌋ + y s ) − y 0 r ∗ = x 1 − ( ⌊ s 2 ⌋ + x s ) , b ∗ = y 1 − ( ⌊ s 2 ⌋ + y s ) \begin{array}{ll} l^{*} & =\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right)-x_{0}, \quad t^{*}=\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right)-y_{0} \\ r^{*} & =x_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+x s\right), \quad b^{*}=y_{1}-\left(\left\lfloor\frac{s}{2}\right\rfloor+y s\right) \end{array} l∗r∗=(⌊2s⌋+xs)−x0,t∗=(⌊2s⌋+ys)−y0=x1−(⌊2s⌋+xs),b∗=y1−(⌊2s⌋+ys)
where ( x 0 , y 0 ) (x_0, y_0) (x0,y0) and ( x 1 , y 1 ) (x_1, y_1) (x1,y1) denote the left-top and rightbottom corners of the ground-truth bounding box B∗ associated with point (x, y).
classification head
一个分支输出分类的分数
ψ
c
l
s
ψ_{cls}
ψcls
location (x, y) on feature map ψ c l s ψ_{cls} ψcls is considered as a positive sample if its corresponding location ( ⌊ s 2 ⌋ + x s , ⌊ s 2 ⌋ + y s ) \left(\left\lfloor\frac{s}{2}\right\rfloor+x s,\left\lfloor\frac{s}{2}\right\rfloor+y s\right) (⌊2s⌋+xs,⌊2s⌋+ys) on the input image falls into the ground-truth bounding box. Otherwise,it is a negative sample.
另一个分支预测PSS
(论文中指出也可以使用IOU
),这个分支是为了评估预测的bbox
质量,用于抑制远离目标中心的bbox
.
Training Objecti
最终优化的损失函数如下:
L ( { p x , y } , q x , y , { t x , y } ) = 1 N p o s ∑ x , y L c l s ( p x , y , c x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L quality ( q x , y , q x , y ∗ ) + λ N p o s ∑ x , y 1 { c x , y ∗ > 0 } L r e g ( t x , y , t x , y ∗ ) \begin{array}{r} L\left(\left\{p_{x, y}\right\}, q_{x, y},\left\{\boldsymbol{t}_{x, y}\right\}\right)=\frac{1}{N_{\mathrm{pos}}} \sum_{x, y} L_{\mathrm{cls}}\left(p_{x, y}, c_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\text {quality }}\left(q_{x, y}, q_{x, y}^{*}\right) \\ +\frac{\lambda}{N_{\mathrm{pos}}} \sum_{x, y} 1_{\left\{c_{x, y}^{*}>0\right\}} L_{\mathrm{reg}}\left(\boldsymbol{t}_{x, y}, \boldsymbol{t}_{x, y}^{*}\right) \end{array} L({px,y},qx,y,{tx,y})=Npos1∑x,yLcls(px,y,cx,y∗)+Nposλ∑x,y1{cx,y∗>0}Lquality (qx,y,qx,y∗)+Nposλ∑x,y1{cx,y∗>0}Lreg(tx,y,tx,y∗)
其中
L
c
l
s
L_{cls}
Lcls使用focal loss
(参考Focal Loss for Dense Object Detection),
L
q
u
a
l
i
t
y
L_{quality}
Lquality使用BCE loss
,
L
r
e
g
L_{reg}
Lreg使用IOU loss
.
补充
论文作者在Appendices B
中对预测过程的处理进行了更详细的讨论.
模型将cls_score
和quality_accessment
进行element-wise production
后得到score map
.之后对score map
进行系列处理(乘hanning window
等操作)得到最终的score map
,即
s
~
[
x
]
\tilde{s}[x]
s~[x].
之后通过一个argmax
操作得到bbox
的预测值,如下所示:
x
∗
=
arg
max
x
∈
[
0..
N
−
1
]
⊗
2
s
~
[
x
]
B
curr
=
B
[
x
∗
]
\begin{aligned} x^{*} &=\arg \max _{x \in[0 . . N-1] \otimes 2} \tilde{s}[x] \\ B_{\text {curr }} &=B\left[x^{*}\right] \end{aligned}
x∗Bcurr =argx∈[0..N−1]⊗2maxs~[x]=B[x∗]
最后根据下面的式子更新得到最终的bbox
(
α
\alpha
α是一个超参数):
α ′ = s ˉ [ x ∗ ] ⋅ α B pred .size = ( 1 − α ′ ) ⋅ B prev . size + α ′ ⋅ B curr ⋅ size \begin{aligned} \alpha^{\prime} &=\bar{s}\left[x^{*}\right] \cdot \alpha \\ B_{\text {pred }} \text { .size } &=\left(1-\alpha^{\prime}\right) \cdot B_{\text {prev }} . \text { size }+\alpha^{\prime} \cdot B_{\text {curr }} \cdot \text { size } \end{aligned} α′Bpred .size =sˉ[x∗]⋅α=(1−α′)⋅Bprev . size +α′⋅Bcurr ⋅ size
实验结果
小结
SiamFC++
延续了SiamFC
简单高速的特点,其中从论文中的实验结果也可以看出新添加的回归分支确实大幅度提升了跟踪的效果,同时这篇论文也借鉴了目标检测中的一些做法,看来检测和跟踪的联系是越来越紧密了,以后也要多多关注目标检测领域的动向了!