论文笔记 Selective search for object recognition (选择性搜索) - IJCV 2013

2012 Selective search for object recognition

J. R. R. Uijlings IJCV, 2013 PDF | Citations 6116

Fig. 1 There is a high variety of reasons that an image region forms an object. In (b) the cats can be distinguished by colour, not texture. In (c) the chameleon can be distinguished from the surrounding leaves by texture, not colour. In (d) the wheels can be part of the car because they are enclosed, not because they are similar in texture or colour. Therefore, tofind objects in a structured way it is necessary to use a variety of diverse strategies. Furthermore, an image is intrinsically hierarchical as there is no single scale for which the complete table, salad bowl, and salad spoon can be found in (a) Region Proposals: Selective Search
  • Find “blobby” image regions that are likely to contain objects
  • Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU

Algorithm 1: Hierarchical Grouping Algorithm


DontPrintSemicolon Input: (colour) image

Output: Set of object location hypotheses L L L

Obtain initial regions R = { r 1 , … , r n } R = \{r_1, \dots , r_n\} R={r1,,rn} using Felzenszwalb and Huttenlocher (2004) Initiates similarity set S = ∅ S=\emptyset S=;

foreach Neighbouring region pair ( r i , r j ) (r_i , r_j ) (ri,rj) do

Calculate similarity s ( r i , r j ) s(r_i , r_j ) s(ri,rj);

S = S ∪ s ( r i , r j ) ; S = S\cup s(r_i , r_j ); S=Ss(ri,rj);

while S ≠ ∅ S\neq\emptyset S= do

Get highest similarity s ( r i , r j ) = max ⁡ ( S ) s(r_i , r_j )=\max(S) s(ri,rj)=max(S);

Merge corresponding regions r t = r i ∪ r j r_t=r_i\cup r_j rt=rirj;

Remove similarities regarding r i r_i ri: S = S \ s ( r i , r ∗ ) S=S\backslash s(r_i,r_*) S=S\s(ri,r);

Remove similarities regarding r j r_j rj: S = S \ s ( r ∗ , r j ) S=S\backslash s(r_*,r_j) S=S\s(r,rj);

Calculate similarity set S t S_t St between r t r_t rt and its neighbours;

S = S ∪ S t S=S\cup S_t S=SSt;

R = R ∪ r t R=R\cup r_t R=Rrt;

Extract object location boxes L L L from all regions in R R R;


Complementary Similarity Measures. We define four complementary, fast-to-compute similarity measures. These measures are all in range [0, 1] which facilitates combinations of these measures.

s c o l o u r ( r i , r j ) s_{colour}(r_i , r_j ) scolour(ri,rj) measures colour similarity. Specifically, for each region we obtain one-dimensional colour histograms for each colour channel using 25 25 25 bins, which we found to work well. This leads to a colour histogram C i = { c i 1 , ⋅ ⋅ ⋅ , c i n } C_i = \{c^1_i , · · · , c^n_i \} Ci={ci1,,cin} for each region r i r_i ri with dimensionality n = 75 n = 75 n=75 when three colour channels are used. The colour histograms are normalised using the L 1 L_1 L1 norm. Similarity is measured using the histogram intersection:
s colour  ( r i , r j ) = ∑ k = 1 n min ⁡ ( c i k , c j k ) s_{\text {colour }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(c_{i}^{k}, c_{j}^{k}\right) scolour (ri,rj)=k=1nmin(cik,cjk)
The colour histograms can be efficiently propagated through the hierarchy by
C t = size ⁡ ( r i ) × C i + size ⁡ ( r j ) × C j size ⁡ ( r i ) + size ⁡ ( r j ) C_{t}=\frac{\operatorname{size}\left(r_{i}\right) \times C_{i}+\operatorname{size}\left(r_{j}\right) \times C_{j}}{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)} Ct=size(ri)+size(rj)size(ri)×Ci+size(rj)×Cj
The size of a resulting region is simply the sum of its constituents: s i z e ( r t ) = s i z e ( r i ) + s i z e ( r j ) {\rm size}(r_t ) = {\rm size}(r_i ) + {\rm size}(r_j ) size(rt)=size(ri)+size(rj).

s t e x t u r e ( r i , r j ) s_{texture}(r_i , r_j ) stexture(ri,rj) measures texture similarity. We represent texture using fast SIFT-like measurements as SIFT itself works well for material recognition (Liu et al. 2010). We take Gaussian derivatives in eight orientations using σ = 1 \sigma = 1 σ=1 for each colour channel. For each orientation for each colour channel we extract a histogram using a bin size of 10 10 10. This leads to a texture histogram T i = t i 1 , ⋅ … , t i n T_i = {t^1_i , ·\dots , t^n_i } Ti=ti1,,tin for each region r i r_i ri with dimensionality n = 240 n = 240 n=240 when three colour channels are used. Texture histograms are normalised using the L 1 L_1 L1 norm. Similarity is measured using histogram intersection:
s texture  ( r i , r j ) = ∑ k = 1 n min ⁡ ( t i k , t j k ) s_{\text {texture }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(t_{i}^{k}, t_{j}^{k}\right) stexture (ri,rj)=k=1nmin(tik,tjk)
s s i z e ( r i , r j ) s_{size}(r_i , r_j ) ssize(ri,rj) encourages small regions to merge early. This forces regions in S S S, i.e. regions which have not yet been merged, to be of similar sizes throughout the algorithm. This is desirable because it ensures that object locations at all scales are created at all parts of the image. For example, it prevents a single region from gobbling up all other regions one by one, yielding all scales only at the location of this growing region and nowhere else. s s i z e ( r i , r j ) s_{size}(r_i , r_j ) ssize(ri,rj) is defined as the fraction of the image that r i r_i ri and $r_j $ jointly occupy:
s s i z e ( r i , r j ) = 1 − size ⁡ ( r i ) + size ⁡ ( r j ) size ⁡ ( i m ) s_{s i z e}\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)}{\operatorname{size}(i m)} ssize(ri,rj)=1size(im)size(ri)+size(rj)
where s i z e ( i m ) {\rm size}(im) size(im) denotes the size of the image in pixels.

s f i l l ( r i , r j ) s_{fill} (r_i , r_j ) sfill(ri,rj) measures how well region r i r_i ri and $r_j $ fit into each other. The idea is to fill gaps: if r i r_i ri is contained in $r_j $ it is logical to merge these first in order to avoid any holes. On the other hand, if r i r_i ri and $r_j $ are hardly touching each other they will likely form a strange region and should not be merged. To keep the measure fast, we use only the size of the regions and of the containing boxes. Specifically, we define B B i j B B_{i j} BBij to be the tight bounding box around r i r_i ri and $r_j $ . Now s f i l l ( r i , r j ) s_{fill} (r_i , r_j ) sfill(ri,rj) is the fraction of the image contained in B B i j B B_{i j} BBij which is not covered by the regions of r i r_i ri and $r_j $ :
 fill  ( r i , r j ) = 1 − size ⁡ ( B B i j ) − size ⁡ ( r i ) − size ⁡ ( r i ) size ⁡ ( i m ) \text { fill }\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(B B_{i j}\right)-\operatorname{size}\left(r_{i}\right)-\operatorname{size}\left(r_{i}\right)}{\operatorname{size}(i m)}  fill (ri,rj)=1size(im)size(BBij)size(ri)size(ri)
In this paper, our final similarity measure is a combination of the above four:
s ( r i , r j ) = a 1 s c o l o u r ( r i , r j ) + a 2 s t e x t u r e ( r i , r j ) + a 3 s s i z e ( r i , r j ) + a 4 s f i l l ( r i , r j ) s\left(r_{i}, r_{j}\right)= a_{1} s_{colour}\left(r_{i}, r_{j}\right)+a_{2} s_{texture}\left(r_{i}, r_{j}\right) +a_{3} s_{s i z e}\left(r_{i}, r_{j}\right)+a_{4} s_{f i l l}\left(r_{i}, r_{j}\right) s(ri,rj)=a1scolour(ri,rj)+a2stexture(ri,rj)+a3ssize(ri,rj)+a4sfill(ri,rj)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值