论文笔记 Selective search for object recognition (选择性搜索) - IJCV 2013

CiLin-Yan

已于 2022-03-10 07:52:12 修改

阅读量245

点赞数

分类专栏：目标识别文章标签：深度学习计算机视觉机器学习

于 2022-03-06 10:35:06 首次发布

本文链接：https://blog.csdn.net/weixin_43791477/article/details/123306669

版权

目标识别专栏收录该内容

8 篇文章 0 订阅

订阅专栏

`2012` Selective search for object recognition

J. R. R. Uijlings IJCV, 2013 PDF | Citations 6116

Fig. 1 There is a high variety of reasons that an image region forms an object. In (b) the cats can be distinguished by colour, not texture. In (c) the chameleon can be distinguished from the surrounding leaves by texture, not colour. In (d) the wheels can be part of the car because they are enclosed, not because they are similar in texture or colour. Therefore, tofind objects in a structured way it is necessary to use a variety of diverse strategies. Furthermore, an image is intrinsically hierarchical as there is no single scale for which the complete table, salad bowl, and salad spoon can be found in (a) Region Proposals: Selective Search

Find “blobby” image regions that are likely to contain objects
Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU

Algorithm 1: Hierarchical Grouping Algorithm

DontPrintSemicolon Input: (colour) image

Output: Set of object location hypotheses $L$

Obtain initial regions $\{r_1, \dots , r_n\}$ using Felzenszwalb and Huttenlocher (2004) Initiates similarity set $S=\emptyset$ ;

foreach Neighbouring region pair $r_i , r_j )$ do

Calculate similarity $s(r_i , r_j )$ ;

$S\cup s(r_i , r_j );$

while $S\neq\emptyset$ do

Get highest similarity $s(r_i , r_j )=\max(S)$ ;

Merge corresponding regions $r_t=r_i\cup r_j$ ;

Remove similarities regarding $r_i$ : $\ s ( r i , r ∗ ) S=S\backslash s(r_i,r_*)$ ;

Remove similarities regarding $r_j$ : $\ s ( r ∗ , r j ) S=S\backslash s(r_*,r_j)$ ;

Calculate similarity set $S_t$ between $r_t$ and its neighbours;

$S=S\cup S_t$ ;

$R=R\cup r_t$ ;

Extract object location boxes $L$ from all regions in $R$ ;

Complementary Similarity Measures. We define four complementary, fast-to-compute similarity measures. These measures are all in range [0, 1] which facilitates combinations of these measures.

$s_{colour}(r_i , r_j )$ measures colour similarity. Specifically, for each region we obtain one-dimensional colour histograms for each colour channel using $25$ bins, which we found to work well. This leads to a colour histogram $C_i = \{c^1_i , · · · , c^n_i \}$ for each region $r_i$ with dimensionality $n = 75$ when three colour channels are used. The colour histograms are normalised using the $L_1$ norm. Similarity is measured using the histogram intersection:
$s_{\text {colour }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(c_{i}^{k}, c_{j}^{k}\right)$
The colour histograms can be efficiently propagated through the hierarchy by
$C_{t}=\frac{\operatorname{size}\left(r_{i}\right) \times C_{i}+\operatorname{size}\left(r_{j}\right) \times C_{j}}{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)}$
The size of a resulting region is simply the sum of its constituents: ${\rm size}(r_t ) = {\rm size}(r_i ) + {\rm size}(r_j )$ .

$s_{texture}(r_i , r_j )$ measures texture similarity. We represent texture using fast SIFT-like measurements as SIFT itself works well for material recognition (Liu et al. 2010). We take Gaussian derivatives in eight orientations using $\sigma = 1$ for each colour channel. For each orientation for each colour channel we extract a histogram using a bin size of $10$ . This leads to a texture histogram $T_i = {t^1_i , ·\dots , t^n_i }$ for each region $r_i$ with dimensionality $n = 240$ when three colour channels are used. Texture histograms are normalised using the $L_1$ norm. Similarity is measured using histogram intersection:
$s_{\text {texture }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(t_{i}^{k}, t_{j}^{k}\right)$
$s_{size}(r_i , r_j )$ encourages small regions to merge early. This forces regions in $S$ , i.e. regions which have not yet been merged, to be of similar sizes throughout the algorithm. This is desirable because it ensures that object locations at all scales are created at all parts of the image. For example, it prevents a single region from gobbling up all other regions one by one, yielding all scales only at the location of this growing region and nowhere else. $s_{size}(r_i , r_j )$ is defined as the fraction of the image that $r_i$ and $r_j $ jointly occupy:
$s_{s i z e}\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)}{\operatorname{size}(i m)}$
where ${\rm size}(im)$ denotes the size of the image in pixels.

$s_{fill} (r_i , r_j )$ measures how well region $r_i$ and $r_j $ fit into each other. The idea is to fill gaps: if $r_i$ is contained in $r_j $ it is logical to merge these first in order to avoid any holes. On the other hand, if $r_i$ and $r_j $ are hardly touching each other they will likely form a strange region and should not be merged. To keep the measure fast, we use only the size of the regions and of the containing boxes. Specifically, we define $B B_{i j}$ to be the tight bounding box around $r_i$ and $r_j $ . Now $s_{fill} (r_i , r_j )$ is the fraction of the image contained in $B B_{i j}$ which is not covered by the regions of $r_i$ and $r_j $ :
$\text { fill }\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(B B_{i j}\right)-\operatorname{size}\left(r_{i}\right)-\operatorname{size}\left(r_{i}\right)}{\operatorname{size}(i m)}$
In this paper, our final similarity measure is a combination of the above four:
$s\left(r_{i}, r_{j}\right)= a_{1} s_{colour}\left(r_{i}, r_{j}\right)+a_{2} s_{texture}\left(r_{i}, r_{j}\right) +a_{3} s_{s i z e}\left(r_{i}, r_{j}\right)+a_{4} s_{f i l l}\left(r_{i}, r_{j}\right)$

CiLin-Yan

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文笔记 Selective search for object recognition (选择性搜索) - IJCV 2013

论文笔记 Selective search for object recognition (选择性搜索)
复制链接

扫一扫