论文笔记：Weakly Supervised Deep Detection Networks(WSDDN)

最新推荐文章于 2022-02-14 18:11:31 发布

kinredon

最新推荐文章于 2022-02-14 18:11:31 发布

阅读量7.3k

点赞数 5

本文链接：https://blog.csdn.net/djh123456021/article/details/84393098

版权

Domain Adaption for Object Detectio 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

论文：Weakly Supervised Deep Detection Networks(WSDDN)

地址：https://arxiv.org/abs/1511.02853

一、简介

这篇论文主要提出了一个弱监督深度检测网络来解决弱监督目标检测问题。弱监督目标检测就是数据只有 image-level 标注，即没有bounding boxes，只告诉这张image中包含哪些类。

作者提出了一个end-to-end的弱监督深度探测网络，主要由上图部分组成：

将CNN网络在imagenet上使用图像分类任务进行pre-trained，作为初始化的feature extractor
在最后一层卷积层得到的feature map上进行region proposals，然后添加SPP layer
然后分成recognition 流和detection流，最后结合在一起，得到预测的image-level labels

二、方法

WSDDN 结构的前面几个步骤，pre-trained, SPP layer, region proposals 在FAST RCNN中都有讨论，这里就不讨论它们了，主要讨论提出的recognition stream 和 detection stream 以及最后的combination。Figure 2 给出了WSDDN更加直观的描述。

在全连接层 fc7 后，再经过 $\phi _ { fc8c}$ , 会得到一个 $\mathbf { x } ^ { c } \in \mathbb { R } ^ { C \times | \mathcal { R } | }$ , 它是多个一个区域上的类概率的concat. 同理经过 $=\phi _ { fc8d}$ , 得到 $\mathbf { x } ^ { d } \in \mathbb { R } ^ { C \times | \mathcal { R } | }$ .

Classification data stream

将得到的 $\mathbf { x } ^ { c }$ , 通过一个 softmax layer，定义如下：

$\left[ \sigma _ { class } \left( \mathbf { x } ^ { c } \right) \right] _ { i j } = \frac { e ^ { x _ { i j } ^ c } } { \sum _ { k = 1 } ^ { C } e ^ { x _ { k j } ^ { c } } }$

得到的结果其实是对 $\mathbf { x } ^ { c }$ 每一列的数据进行相应的softmax转化，相当于某个区域中的类别概率rank。

Detection data stream

同样将得到的 $\mathbf { x } ^ { d }$ , 通过一个 softmax layer，定义如下：

$\left[ \sigma _ { class } \left( \mathbf { x } ^ { d } \right) \right] _ { i j } = \frac { e ^ { x _ { i j } ^ d } } { \sum _ { k = 1 } ^ { \vert {\mathcal { R } } \vert } e ^ { x _ { i k } ^ { d } } }$

得到的结果其实是对 $\mathbf { x } ^ { d }$ 每一行的数据进行相应的softmax转化，相当于某个类在所有区域中的分数概率rank。

Image-level classification scores

根据得到的两个矩阵，进行element-wise的乘积，得到每个区域region-level最终的分数：

$\mathbf { x } ^ { \mathcal { R } } = \sigma _ { \text { class } } \left( \mathbf { x } ^ { c } \right) \odot \sigma _ { \mathrm { det } } \left( \mathbf { x } ^ { d } \right)$

通过求和得到image-level的类预测分数：

$\sum _ { r = 1 } ^ { | \mathcal { R } | } x _ { c r } ^ { \mathcal { R } }$

Energy function

根据前面得到的image-level得到类预测分数与ground truth label进行比较，得到energy function，其实就是loss ：

$\mathbf { w } ) = \frac { \lambda } { 2 } \| \mathbf { w } \| ^ { 2 } + \sum _ { i = 1 } ^ { n } \sum _ { k = 1 } ^ { C } \log \left( y _ { k i } \left( \phi _ { k } ^ { \mathbf { y } } \left( \mathbf { x } _ { i } | \mathbf { w } \right) - \frac { 1 } { 2 } \right) + \frac { 1 } { 2 } \right)$

其中$y_{ki} \in {-1, 1} $ 代表ground truth label， $\phi _ { k } ^ { \mathbf { y } } \left( \mathbf { x } _ { i } | \mathbf { w } \right)$ 代表预测的结果。当 $y_{ki} = 1$ 时，结果为 $\phi _ { k } ^ { \mathbf { y } } \left( \mathbf { x } _ { i } | \mathbf { w } \right) )$ , 否则为 $\phi _ { k } ^ { \mathbf { y } } \left( \mathbf { x } _ { i } | \mathbf { w } \right) )$ 。

这就出现一个问题，optimize 是要最大化，还是要最小化。看前面的正则项应该要最小化，但后面这一项要最大化才是优化的方向，比如当$y_{ki} = 1 $ 时，要 $\phi _ { k } ^ { \mathbf { y } } \left( \mathbf { x } _ { i } | \mathbf { w } \right) )$ 越接近1 越好，这就导致loss变大。不知道是我理解错了，还是论文错了：）

loss 函数出了这一项还有一项，叫Spatial Regulariser，这一项的目的是为了得到给加精确的定位，因为不像Fast RCNN这些，他们有对应的定位regression，只有image-level的数据可以获得，作者使用一个软正则的策略去惩罚feature map偏差

$\frac { 1 } { n C } \sum _ { k = 1 } ^ { C } \sum _ { i = 1 } ^ { N _ { k } ^ { + } } \sum _ { r = 1 } ^ { | \overline { R } | } \frac { 1 } { 2 } \left( \phi _ { k * i } ^ { \mathbf { y } } \right) ^ { 2 } \left( \phi _ { k * i } ^ { \mathrm { fc } 7 } - \phi _ { k r i } ^ { \mathrm { fc } 7 } \right) ^ { \mathrm { T } } \left( \phi _ { k * i } ^ { \mathrm { fc } 7 } - \phi _ { k r i } ^ { \mathrm { fc } 7 } \right)$

其中， $N_k^+$ 代表positive image的数量， $\overline { R }$ 指的是与最高分的region有至少60%的IOU的区域。* 代表了类别 k，图片i中最高分数的区域。

三、实验

实验结果超过当时state-of-the-art：）

kinredon

关注

5
点赞
踩
17

收藏

觉得还不错? 一键收藏
5
评论
论文笔记：Weakly Supervised Deep Detection Networks(WSDDN)

论文：Weakly Supervised Deep Detection Networks(WSSDN)地址：https://arxiv.org/abs/1511.02853一、简介这篇论文主要提出了一个弱监督深度检测网络来解决弱监督目标检测问题。弱监督目标检测就是数据只有 image-level 标注，即没有bounding boxes，只告诉这张image中包含哪些类。作者提出了一...
复制链接

扫一扫