简介
paper:Distractor-aware Siamese Networks for Visual Object Tracking
code:foolwood/DaSiamRPN
参考:ECCV视觉目标跟踪之DaSiamRPN的评论区~
这篇论文是在SiamRPN
的基础上改进的。论文提出的动机主要有两点:一是在之前的siamese跟踪器只能区分开前景和无语义的背景(non-semantic background
,这里所谓的non-semantic
是指not real object,just background
),这主要是由于训练数据中non-semantic
样本和semantic
样本不均衡造成的;二是siamese跟踪器只对目标区域邻近的区域进行搜索,这就造成目标运动幅度大时跟踪效果不好。
主要内容
Distractor-aware Training
如上图所示,SiamRPN
在丢失目标时,分类的分数仍然比较高,作者推断SiamRPN
只是学习到了objectness/non-objectness的区分,作者认为:一是正样本种类不够多,二是non-semantic
样本和semantic
样本不均衡;
针对这两个问题,作者从三个方面进行改进:
-
增加训练样本的类别数: 在训练过程中除了使用
VID
和Youtube-BB
两个数据集之外,还使用了ImageNet
和COCO
两个数据集用于扩充训练样本的类别数; -
增加
negative pairs
: 增加负样本主要有两方面,一是增加semantic background
样本;二是增加同类负样本和不同类负样本,其中同类负样本使得跟踪器更易应对相似物的干扰,不同类负样本使得跟踪器对跟踪目标被遮挡或离开跟踪画面的情况更加鲁棒。 -
主观地进行图像增广:通过图像增广的方式,比如旋转,比例变化和光照变化,运动模糊,从而使得训练得到的模型更加鲁棒。
Except the common translation, scale variations and illumination changes, we observe that the motion pattern can be easily modeled by the shallow layers in the network. We explicitly introduce motion
blur in the data augmentation
Distractor-aware Incremental Learning
在线跟踪时,借助了干扰物来提高模型的鲁棒性,具体如下:
在每一帧中获得 17 × 17 × 5 17 \times 17 \times 5 17×17×5个提议框,然后使用NMS来减少多余的候选者,得分最高的将被选为目标 z t z_t zt. 对于其余部分,将选择分数大于阈值的作为干扰物,并组合成集合 D : = { ∀ d i ∈ D , f ( z , d i ) > h ∩ d i ≠ z t } \mathcal{D}:=\left\{\forall d_{i} \in \mathcal{D}, f\left(z, d_{i}\right)>h \cap d_{i} \neq z_{t}\right\} D:={∀di∈D,f(z,di)>h∩di=zt}且 ∣ D ∣ = n |\mathcal{D}|=n ∣D∣=n
Specifically, we get 17 × 17 × 5 17 \times 17 \times 5 17×17×5 proposals in each frame at first, and then we use NMS to reduce redundant candidates. The proposal with highest score will be selected as the target z t z_t zt. For the remaining, the proposals with scores greater than a threshold are selected as distractors.
选择的目标根据下面式子得到:
q = argmax p k ∈ P f ( z , p k ) − α ^ ∑ i = 1 n α i f ( d i , p k ) ∑ i = 1 n α i q=\underset{p_{k} \in \mathcal{P}}{\operatorname{argmax}} f\left(z, p_{k}\right)-\frac{\hat{\alpha} \sum_{i=1}^{n} \alpha_{i} f\left(d_{i}, p_{k}\right)}{\sum_{i=1}^{n} \alpha_{i}} q=pk∈Pargmaxf(z,pk)−∑i=1nαiα^∑i=1nαif(di,pk)
由于 f ( z , x ) f(z,x) f(z,x)是线性运算,所以可以展开为:
q = argmax p k ∈ P ( φ ( z ) − α ^ ∑ i = 1 n α i φ ( d i ) ∑ i = 1 n α i ) ⋆ φ ( p k ) q=\underset{p_{k} \in \mathcal{P}}{\operatorname{argmax}}\left(\varphi(z)-\frac{\hat{\alpha} \sum_{i=1}^{n} \alpha_{i} \varphi\left(d_{i}\right)}{\sum_{i=1}^{n} \alpha_{i}}\right) \star \varphi\left(p_{k}\right) q=pk∈Pargmax(φ(z)−∑i=1nαiα^∑i=1nαiφ(di))⋆φ(pk)
最终在预测第 T + 1 T+1 T+1帧的目标时,采用下面式子得到(这个式子没明白):
q T + 1 = argmax p k ∈ P ( ∑ t = 1 T β t φ ( z t ) ∑ t = 1 T β t − ∑ t = 1 T β t α ^ ∑ i = 1 n α i φ ( d i , t ) ∑ t = 1 T β t ∑ i = 1 n α i ) ⋆ φ ( p k ) q_{T+1}=\underset{p_{k} \in \mathcal{P}}{\operatorname{argmax}}\left(\frac{\sum_{t=1}^{T} \beta_{t} \varphi\left(z_{t}\right)}{\sum_{t=1}^{T} \beta_{t}}-\frac{\sum_{t=1}^{T} \beta_{t} \hat{\alpha} \sum_{i=1}^{n} \alpha_{i} \varphi\left(d_{i, t}\right)}{\sum_{t=1}^{T} \beta_{t} \sum_{i=1}^{n} \alpha_{i}}\right) \star \varphi\left(p_{k}\right) qT+1=pk∈Pargmax(∑t=1Tβt∑t=1Tβtφ(zt)−∑t=1Tβt∑i=1nαi∑t=1Tβtα^∑i=1nαiφ(di,t))⋆φ(pk)
The weight factor α i α_i αi can be viewed as the dual variables with sparse regularization, and the exemplars and distractors can be viewed as positive and negative samples in correlation filters.
DaSiamRPN for Long-term Tracking
为了应对长时跟踪中目标丢失的问题,该模型在跟踪丢失时按一定步长增大搜索区域,直到目标再次跟踪成功(关键是如何正确地判断跟踪失败的开始和结束,这个论文中没有详细说明)。
During failure cases, we gradually increase the search region by local-toglobal strategy. Specifically, the size of search region is iteratively growing with
a constant step when failed tracking is indicated.
实验结果
小结
这篇论文从训练数据的角度发现了siamese跟踪算法的问题,论文写得比较通俗易懂值得一看!