[论文阅读 2020 CVPR 目标跟踪]SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual

最新推荐文章于 2024-12-10 18:54:21 发布

lingqing97

最新推荐文章于 2024-12-10 18:54:21 发布

阅读量5.8k

点赞数 4

分类专栏：论文阅读文章标签：目标跟踪人工智能深度学习机器学习

本文链接：https://blog.csdn.net/qq_39621037/article/details/115016600

版权

论文阅读专栏收录该内容

19 篇文章

订阅专栏

简介

paper:SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking

code:ohhhyeahhh/SiamCAR

这篇论文提出的动机是：SiamRPN系列的跟踪器依赖于RPN来进行Classfication和Regression,而这些基于RPN的跟踪需要设置好anchor boxes相关参数才能达到比较好的跟踪效果，对调参的要求较高。基于此，这篇论文基于一个简单的网络结构，实现了一个简单高效的跟踪模型。

主要内容

在这里插入图片描述

上图是SiamCAR的网络结构，SiamCAR分为用于提取特征的Siamese Subnetwork和用于分类和回归的Classification-Regression Subnetwork两部分，而SiamCAR最重要的贡献点在于后半部分的Classification-regression Subnetwork.

Feature Extraction

SiamCAR的特征提取网络使用的是SiamRPN++中修改过的Resnet50,同SiamRPN++一样，采用了REsnet50后三层的特征经过cat后分别得到Template patch的feature map $\varphi(Z)$ 和Search region的feature map $\varphi(X)$ .

之后采用SiamRPN++中的Depthwise Cross Correlation计算Correlation:

$R=\varphi(X) \star \varphi(Z)$

由于此时得到的response map R的通道数太大( $256 * 3$ ),所以作者使用了一个 $1 X 1$ 的卷积层对response map进行了降维,减少了模型参数量并提高了模型推导速度。

Bounding Box Prediction

Classification-Regresssion Subnetwork又分为classification branch和regression branch.其中对于regression branch输出 $(l, t, r, b)$ ，表示从相应位置到搜索区域中边界框四个边角的距离.而classification branch除了输出clasification，作者还添加了一个centerness branch,且centerness branch输出centerness score(动机是：作者观察到距离目标越远的bbox质量越低)

An observation is that the locations far away from the center of an target tend to produce low-quality predicted bounding boxes, which reduces the performance of the tracking system.

最终，对于regression branch的输出使用IOU loss，classification branch使用cross-entropy loss,而对于centerness branch使用如下损失:

$j)=\mathbb{I}\left(\tilde{t}_{(i, j)}\right) * \sqrt{\frac{\min (\tilde{l}, \tilde{r})}{\max (\tilde{l}, \tilde{r})} \times \frac{\min (\tilde{t}, \tilde{b})}{\max (\tilde{t}, \tilde{b})}}$

$\begin{aligned} \mathcal{L}_{c e n} &=\frac{-1}{\sum \mathbb{I}\left(\tilde{t}_{(i, j)}\right)} \sum_{\mathbb{I}\left(\tilde{t}_{(i, j)}\right)==1} C(i, j) * \log A_{w \times h \times 1}^{c e n}(i, j) +(1-C(i, j)) * \log \left(1-A_{w \times h \times 1}^{c e n}(i, j)\right) \end{aligned}$

最终总损失为:

$\mathcal{L}=\mathcal{L}_{c l s}+\lambda_{1} \mathcal{L}_{c e n}+\lambda_{2} \mathcal{L}_{r e g}$

The Tracking Phase

最终跟踪时，对于位置 $(i, j)$ ,网络会输出一个6维的向量 $(c l s, c e n, l, t, r, b)$

For a location (i, j), the proposed method produces a 6D vector $T_{ij} = (cls, cen, l, t, r, b)$ ,where $c l s$ represents the foreground score of classification, $c e n$ represents the centerness socre, and $l + r$ and $t + b$ represent the predicted width and height of the target in current frame.

所以根据下面的式子计算highesdt score所在的位置 $q$ :

$q=\arg \max _{i, j}\left\{\left(1-\lambda_{d}\right) c l s_{i j} \times p_{i j}+\lambda_{d} H_{i j}\right\}$

where $H$ is the cosine window and $λ_d$ is the balance weight. The output $q$ is a queried location with the highest score being a target pixel.

由于在 $q$ 周围的都有可能是是trage pixel，所以作者又计算 $q$ 的 $n$ 个neightborhoods的 $cls_{ij} * p_{i,j}$ 得分，并取top-k的regression boxes进行加权平均得到最终的target box.

We observed that the pixels located around $q$ are more likely to be the target pixel. Hence we choose the top-k points from n neighborhoods of $q$ according to the value $cls_{ij} × p_{ij}$ . The final prediction is the weighted average of the selected $k$ regression boxes