READING NOTE: Object Detection by Labeling Superpixels

最新推荐文章于 2019-07-13 10:02:36 发布

Joshua_Li_

最新推荐文章于 2019-07-13 10:02:36 发布

阅读量759

点赞数

分类专栏：计算机视觉文章标签：计算机视觉

本文链接：https://blog.csdn.net/joshua_1988/article/details/50187351

版权

计算机视觉专栏收录该内容

72 篇文章 0 订阅

订阅专栏

TITLE: Object Detection by Labelling Superpixels

AUTHOR: Yan, Junjie and Yu, Yinan and Zhu, Xiangyu and Lei, Zhen and Li, Stan Z.

FROM: CVPR2015

CONTRIBUTIONS

Convert object detection problem into super-pixel labelling problem, which could avoid false negatives caused by proposals and could take advantages from global contexts.
Conduct an energy function considering appearance, spatial context and numbers of labels.

METHOD

The image is partitioned into a set of super-pixels, denoted as $\mathcal{P}=\lbrace p_{1},p_{2},...,p_{N}\rbrace$ .
An energy function $E(\mathcal{L})$ is calculated to measure the corresponding label configuration for each super-pixels, where $\mathcal{L}=\lbrace l_{1},l_{2},...,l_{N}\rbrace$ .
The problem is transfered to select an $\mathcal{L}$ to minimise $E(\mathcal{L})$ .

SOME DETAILS

The energy function is conducted as

E (L) = \sum p i \in P D (l i, p i) + \sum (p i, p j) \in N V (l i, l j, p i, p j) + C (L)

$E(\mathcal{L})=\sum_{\mathcal{p_{i}}\in\mathcal{P}}D(l_{i},p_{i})+\sum_{(p_{i},p_{j})\in\mathcal{N}}V(l_{i},l_{j},p_{i},p_{j})+C(\mathcal{L})$

where $D(l_{i},p_{i})$ is the data cost to capture the appearance of $p_{i}$ and measure its cost of belonging to label $l_{i}$ , $V(l_{i},l_{j},p_{i},p_{j})$ is the pairwise smooth cost in the local area $\mathcal{N}$ and $C(\mathcal{L})$ is the label cost to encourage compact detection and to punish the number of labels.

Data Cost

Super-pixels usually does not have enough semantic information, so corresponding regions are classified and their costs are propagated to super-pixels. In this work, RCNN is used to generate and classify semantic regions. The region set of $T$ elements is denoted as $\mathcal{R}=\lbrace r_{1},..,r_{T}\rbrace$ and the classifier score is $s_{t}$ , thus we can map the scores into $(0,1)$ by

D (l t, r t) = ⎧ ⎩ ⎨ 1 1 + e x p ( - α \cdot s t ) e x p ( - α \cdot s t ) 1 + e x p ( - α \cdot s t ) if l t > 0 if l t = 0

$D(l_{t},r_{t})= \begin{cases} \frac{1}{1+\mathit{exp}(-\alpha\cdot s_{t})}& \text{if }l_{t}>0 \\\ \frac{\mathit{exp}(-\alpha\cdot s_{t})}{1+\mathit{exp}(-\alpha\cdot s_{t})}& \text{if }l_{t}=0 \end{cases}$

where $\alpha$ is set to 1.5 empirically. For each super-pixel the data cost is the weighted sum of T smallest costs,

D (l i, p i) = \sum t = 1 T ω d t \cdot D (l t, R (p i) t)

$D(l_{i},p_{i})= \sum_{t=1}^{T}\omega_{d_{t}}\cdot D(l_{t}, R(p_{i})_{t})$

where $R(p_{i})_{t}$ is the region $p_{i}$ belongs to with the $t$ -th smallest cost.

Smooth Cost

The smooth cost is conducted for the reason that 1) adjacent super-pixels often have the same label and 2) super-pixels belonging to the same label should have similar apprearance. This attribute is measured by

V (l i, l j, p i, p j) = ω s l V l (l i, l j) + V a (l i, l j, p i, p j)

$V(l_{i},l_{j},p_{i},p_{j})=\omega_{s_{l}}V_{l}(l_{i}, l_{j})+V_{a}(l_{i},l_{j},p_{i},p_{j})$

where $V_{l}$ is a boolean variable and is set to $1$ when $l_{i}=l_{j}$ and $(p_{i},p_{j})\in \mathcal{N}$ . $V_{a}$ is defined as

V a (l i, l j, p i, p j) = ω s c (1 - \sum q m i n (c q i), c q j) + ω s t (1 - \sum q m i n (t q i), t q j)

$V_{a}(l_{i},l_{j},p_{i},p_{j})=\omega_{s_{c}}(1-\sum_{q}\mathit{min}(c_{i}^{q}), c_{j}^{q})+\omega_{s_{t}}(1-\sum_{q}\mathit{min}(t_{i}^{q}), t_{j}^{q})$

where $c_{i}^{q}$ and $t_{i}^{q}$ are the values in the $q$ -th bin of color and texture histogram of super-pixel $p_{i}$ . In this work color histogram and SIFT histogram are calculated to describe color and texture information.

Label Cost

The label cost is used to encourage less number of labels and its defination is

C (L) = \sum i = 1 K ω l i \cdot δ (i, L)

$C(\mathcal{L})=\sum_{i=1}^{K}\omega_{l_{i}}\cdot \delta(i, \mathcal{L})$

where $\delta(\cdot)$ is defined as

δ (i, L) = {10 if i \in L if otherwise

$\delta(i, \mathcal{L})=\begin{cases} 1& \text{if }i\in \mathcal{L} \\\ 0& \text{if otherwise} \end{cases}$

ADVANTAGES

Super-pixels are compact and perceptually meaningful atomic regions for images.
Avoid false negatives caused by inappropriate proposals generated by algorithms suchas Selective Search and BING.
Super-pixel based method is a trade-off of Pixel based and Proposal based algorithm, leading to accurate and fast results.

DISADVANTAGES

The CNN used in RCNN and the parameters in the energy function are learned separately.
The region generated might not cover all the super-pixels.
Time consumption is high. Its speed is 1fps for each 128 proposals on a NVIDIA Telsa K40 GPU. However, 128 proposals might not be enough.