[论文阅读 2019 CVPR 目标跟踪]Unsupervised Deep Tracking

最新推荐文章于 2023-07-05 17:01:06 发布

lingqing97

最新推荐文章于 2023-07-05 17:01:06 发布

阅读量449

点赞数 2

分类专栏：论文阅读文章标签：目标跟踪计算机视觉深度学习机器学习

本文链接：https://blog.csdn.net/qq_39621037/article/details/115872574

版权

论文阅读专栏收录该内容

19 篇文章 5 订阅

订阅专栏

简介

paper:Unsupervised Deep Tracking

code:594422814/UDT_pytorch

这篇论文的亮点在于通过无监督学习的方式来进行目标跟踪，且取得了与监督学习模型相当的精度。

论文的基本思想是：首先将当前帧作为template frame,后一帧作为Search frame,从而得到响应 $R_S$ ，之后反过来，将后一帧作为template frame同时以预测结果 $R_S$ 作为其标签，然后当前帧作为Search frame预测当前帧的响应 $R_T$ ,然后通过计算 $R_T$ 和真实标签(高斯标签) $Y_T$ 之间的误差来训练模型。这就像录像的正放和倒放，目标在当前帧的位置应当是相同的。（见下图）

在这里插入图片描述

主要内容

在这里插入图片描述

如上图所示是这篇论文中模型的主要框架，其中跟踪模型采用了Dcfnet,整个训练过程分为Forward和Backward两步。

Revisiting Correlation Tracking

论文采用基于相关滤波框架的模型，相关滤波模型的优化目标是找到一个最优的滤波器 $W$ ，从而使得下式的值最小:

$\min _{\mathbf{W}}\|\mathbf{W} * \mathbf{X}-\mathbf{Y}\|_{2}^{2}+\lambda\|\mathbf{W}\|_{2}^{2}$

Here, $Y$ is a ground-truth label, $\lambda$ is a regularization paramenter , $*$ denotes the circular convolution.

将上式转换为在傅里叶域内计算，则滤波器 $W$ 可以通过下式计算得到:

$\mathbf{W}=\mathscr{F}^{-1}\left(\frac{\mathscr{F}(\mathbf{X}) \odot \mathscr{F}^{\star}(\mathbf{Y})}{\mathscr{F}^{\star}(\mathbf{X}) \odot \mathscr{F}(\mathbf{X})+\lambda}\right)$

where $\odot$ is the element-wise product, $\mathscr{F}(\cdot)$ is the Discrete Fourier Transform (DFT), $\mathscr{F}^{-1}(\cdot)$ is the inverse DFT, and $\star$ denotes the complex-conjugate operation.

而响应图 $R$ 可以通过下式得到:

$\mathbf{R}=\mathbf{W} * \mathbf{Z}=\mathscr{F}^{-1}\left(\mathscr{F}^{\star}(\mathbf{W}) \odot \mathscr{F}(\mathbf{Z})\right)$

Unsupervised Learning Prototype

无监督学习过程分为Forward Tracking和Backward Tracking两部分。

Forward Tracking

前向跟踪时，从首帧 $P_1$ 中取template patch $T$ ,下一帧中取search patch为 $S$ ,同时以高斯标签 $Y_T$ 作为其标签，根据DCF中的公式（也就是上面计算滤波器的公式）可计算滤波器模板 $W_T$ 为:

$\mathbf{W}_{\mathbf{T}}=\mathscr{F}^{-1}\left(\frac{\mathscr{F}\left(\varphi_{\theta}(\mathbf{T})\right) \odot \mathscr{F}^{\star}\left(\mathbf{Y}_{\mathbf{T}}\right)}{\mathscr{F}^{\star}\left(\varphi_{\theta}(\mathbf{T})\right) \odot \mathscr{F}\left(\varphi_{\theta}(\mathbf{T})\right)+\lambda}\right)$

且响应为:

$\mathbf{R}_{\mathbf{S}}=\mathscr{F}^{-1}\left(\mathscr{F}^{\star}\left(\mathbf{W}_{\mathbf{T}}\right) \odot \mathscr{F}\left(\varphi_{\theta}(\mathbf{S})\right)\right)$

Backward Tracking

反向跟踪时，将前向跟踪得到的响应 $R_S$ 作为标签 $Y_S$ ,同时将 $S$ 作为template patch, $T$ 作为search path.计算反向跟踪的滤波器目标 $W_S$ 为:

$\mathbf{W}_{\mathbf{S}}=\mathscr{F}^{-1}\left(\frac{\mathscr{F}\left(\varphi_{\theta}(\mathbf{S})\right) \odot \mathscr{F}^{\star}\left(\mathbf{Y}_{\mathbf{S}}\right)}{\mathscr{F}^{\star}\left(\varphi_{\theta}(\mathbf{S})\right) \odot \mathscr{F}\left(\varphi_{\theta}(\mathbf{S})\right)+\lambda}\right)$

同理计算反向跟踪的响应 $R_T$ 为:

$\mathbf{R}_{\mathbf{T}}=\mathscr{F}^{-1}\left(\mathscr{F}^{\star}\left(\mathbf{W}_{\mathbf{S}}\right) \odot \mathscr{F}\left(\varphi_{\theta}(\mathbf{T})\right)\right)$

Consistency Loss Computation

根据预想, $R_T$ 应当尽可能与原始标签 $Y_T$ 相似，即最小化它们之间的误差:

$\mathcal{L}_{\text {un }}=\left\|\mathbf{R}_{\mathbf{T}}-\mathbf{Y}_{\mathbf{T}}\right\|_{2}^{2}$

最终通过下面的偏导来更新网络:

$\begin{aligned} \frac{\partial \mathcal{L}_{\mathrm{un}}}{\partial \varphi_{\theta}(\mathbf{T})} &=\mathscr{F}^{-1}\left(\frac{\partial \mathcal{L}_{\mathrm{un}}}{\partial\left(\mathscr{F}\left(\varphi_{\theta}(\mathbf{T})\right)\right)^{\star}}+\left(\frac{\partial \mathcal{L}_{\mathrm{un}}}{\partial\left(\mathscr{F}\left(\varphi_{\theta}(\mathbf{T})\right)\right)}\right)^{\star}\right) \\ \frac{\partial \mathcal{L}_{\mathrm{un}}}{\partial \varphi_{\theta}(\mathbf{S})} &=\mathscr{F}^{-1}\left(\frac{\partial \mathcal{L}_{\mathrm{un}}}{\partial\left(\mathscr{F}\left(\varphi_{\theta}(\mathbf{S})\right)\right)^{\star}}\right) \end{aligned}$

Unsupervised Learning Improvements

在实际测试中，作者发现两个问题：一个是跟踪器可能会在向前跟踪中偏离目标，但是在向后跟踪中仍然返回到原始位置；二是有些训练数据包含无语意信息或者目标被遮挡。

对于这些情况，作者提出了两个新的举措的来解决。

Multiple Frames Validation

在这里插入图片描述

为了解决上面提到的第一种问题，作者通过使用三帧相邻的图像来放大预测的误差。如上图中右图所示，对第二帧预测的结果在进行一次前向跟踪得到第三帧的预测结果，之后将第三帧作为template,第一帧作为search,之后进行反向跟踪，并定义新误差为:

$\mathcal{L}_{\mathrm{un}}=\left\|\widetilde{\mathbf{R}}_{\mathbf{T}}-\mathbf{Y}_{\mathbf{T}}\right\|_{2}^{2}$

where $\widetilde{\mathbf{R}}_{\mathbf{T}}$ is the response map generated by an additional frame during the backward tracking step.

Cost-sensitive Loss

为了解决第二个问题，作者引入了motion weight vector $A_{motion}$ ，每个位置的 $A^{i}_{motion}$ 通过下面式子计算得到:

$\mathbf{A}_{\mathrm{motion}}^{i}=\left\|\mathbf{R}_{\mathbf{S}_{1}}^{i}-\mathbf{Y}_{\mathbf{T}}^{i}\right\|_{2}^{2}+\left\|\mathbf{R}_{\mathbf{S}_{2}}^{i}-\mathbf{Y}_{\mathbf{S}_{1}}^{i}\right\|_{2}^{2}$

$A^{i}_{motion}$ 的值越大，代表目标在该连续帧中运动的幅度越大。

之后对 $A^{i}_{motion}$ 进行正则化:

$\mathbf{A}_{\text {norm }}^{i}=\frac{\mathbf{A}_{\text {drop }}^{i} \cdot \mathbf{A}_{\text {motion }}^{i}}{\sum_{i=1}^{n} \mathbf{A}_{\text {drop }}^{i} \cdot \mathbf{A}_{\text {motion }}^{i}}$

最终的损失函数通过下面的式子决定:

$\mathcal{L}_{\mathrm{un}}=\frac{1}{n} \sum_{i=1}^{n} \mathbf{A}_{\mathrm{norm}}^{i} \cdot\left\|\widetilde{\mathbf{R}}_{\mathbf{T}}^{i}-\mathbf{Y}_{\mathbf{T}}^{i}\right\|_{2}^{2} .$