2012
Selective search for object recognition
J. R. R. Uijlings
IJCV, 2013 PDF | Citations 6116
- Find “blobby” image regions that are likely to contain objects
- Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU
Algorithm 1: Hierarchical Grouping Algorithm
DontPrintSemicolon Input: (colour) image
Output: Set of object location hypotheses L L L
Obtain initial regions R = { r 1 , … , r n } R = \{r_1, \dots , r_n\} R={r1,…,rn} using Felzenszwalb and Huttenlocher (2004) Initiates similarity set S = ∅ S=\emptyset S=∅;
foreach Neighbouring region pair ( r i , r j ) (r_i , r_j ) (ri,rj) do
Calculate similarity s ( r i , r j ) s(r_i , r_j ) s(ri,rj);
S = S ∪ s ( r i , r j ) ; S = S\cup s(r_i , r_j ); S=S∪s(ri,rj);
while S ≠ ∅ S\neq\emptyset S=∅ do
Get highest similarity s ( r i , r j ) = max ( S ) s(r_i , r_j )=\max(S) s(ri,rj)=max(S);
Merge corresponding regions r t = r i ∪ r j r_t=r_i\cup r_j rt=ri∪rj;
Remove similarities regarding r i r_i ri: S = S \ s ( r i , r ∗ ) S=S\backslash s(r_i,r_*) S=S\s(ri,r∗);
Remove similarities regarding r j r_j rj: S = S \ s ( r ∗ , r j ) S=S\backslash s(r_*,r_j) S=S\s(r∗,rj);
Calculate similarity set S t S_t St between r t r_t rt and its neighbours;
S = S ∪ S t S=S\cup S_t S=S∪St;
R = R ∪ r t R=R\cup r_t R=R∪rt;
Extract object location boxes L L L from all regions in R R R;
Complementary Similarity Measures. We define four complementary, fast-to-compute similarity measures. These measures are all in range [0, 1] which facilitates combinations of these measures.
s
c
o
l
o
u
r
(
r
i
,
r
j
)
s_{colour}(r_i , r_j )
scolour(ri,rj) measures colour similarity. Specifically, for each region we obtain one-dimensional colour histograms for each colour channel using
25
25
25 bins, which we found to work well. This leads to a colour histogram
C
i
=
{
c
i
1
,
⋅
⋅
⋅
,
c
i
n
}
C_i = \{c^1_i , · · · , c^n_i \}
Ci={ci1,⋅⋅⋅,cin} for each region
r
i
r_i
ri with dimensionality
n
=
75
n = 75
n=75 when three colour channels are used. The colour histograms are normalised using the
L
1
L_1
L1 norm. Similarity is measured using the histogram intersection:
s
colour
(
r
i
,
r
j
)
=
∑
k
=
1
n
min
(
c
i
k
,
c
j
k
)
s_{\text {colour }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(c_{i}^{k}, c_{j}^{k}\right)
scolour (ri,rj)=k=1∑nmin(cik,cjk)
The colour histograms can be efficiently propagated through the hierarchy by
C
t
=
size
(
r
i
)
×
C
i
+
size
(
r
j
)
×
C
j
size
(
r
i
)
+
size
(
r
j
)
C_{t}=\frac{\operatorname{size}\left(r_{i}\right) \times C_{i}+\operatorname{size}\left(r_{j}\right) \times C_{j}}{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)}
Ct=size(ri)+size(rj)size(ri)×Ci+size(rj)×Cj
The size of a resulting region is simply the sum of its constituents:
s
i
z
e
(
r
t
)
=
s
i
z
e
(
r
i
)
+
s
i
z
e
(
r
j
)
{\rm size}(r_t ) = {\rm size}(r_i ) + {\rm size}(r_j )
size(rt)=size(ri)+size(rj).
s
t
e
x
t
u
r
e
(
r
i
,
r
j
)
s_{texture}(r_i , r_j )
stexture(ri,rj) measures texture similarity. We represent texture using fast SIFT-like measurements as SIFT itself works well for material recognition (Liu et al. 2010). We take Gaussian derivatives in eight orientations using
σ
=
1
\sigma = 1
σ=1 for each colour channel. For each orientation for each colour channel we extract a histogram using a bin size of
10
10
10. This leads to a texture histogram
T
i
=
t
i
1
,
⋅
…
,
t
i
n
T_i = {t^1_i , ·\dots , t^n_i }
Ti=ti1,⋅…,tin for each region
r
i
r_i
ri with dimensionality
n
=
240
n = 240
n=240 when three colour channels are used. Texture histograms are normalised using the
L
1
L_1
L1 norm. Similarity is measured using histogram intersection:
s
texture
(
r
i
,
r
j
)
=
∑
k
=
1
n
min
(
t
i
k
,
t
j
k
)
s_{\text {texture }}\left(r_{i}, r_{j}\right)=\sum_{k=1}^{n} \min \left(t_{i}^{k}, t_{j}^{k}\right)
stexture (ri,rj)=k=1∑nmin(tik,tjk)
s
s
i
z
e
(
r
i
,
r
j
)
s_{size}(r_i , r_j )
ssize(ri,rj) encourages small regions to merge early. This forces regions in
S
S
S, i.e. regions which have not yet been merged, to be of similar sizes throughout the algorithm. This is desirable because it ensures that object locations at all scales are created at all parts of the image. For example, it prevents a single region from gobbling up all other regions one by one, yielding all scales only at the location of this growing region and nowhere else.
s
s
i
z
e
(
r
i
,
r
j
)
s_{size}(r_i , r_j )
ssize(ri,rj) is defined as the fraction of the image that
r
i
r_i
ri and $r_j $ jointly occupy:
s
s
i
z
e
(
r
i
,
r
j
)
=
1
−
size
(
r
i
)
+
size
(
r
j
)
size
(
i
m
)
s_{s i z e}\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(r_{i}\right)+\operatorname{size}\left(\mathrm{r}_{\mathrm{j}}\right)}{\operatorname{size}(i m)}
ssize(ri,rj)=1−size(im)size(ri)+size(rj)
where
s
i
z
e
(
i
m
)
{\rm size}(im)
size(im) denotes the size of the image in pixels.
s
f
i
l
l
(
r
i
,
r
j
)
s_{fill} (r_i , r_j )
sfill(ri,rj) measures how well region
r
i
r_i
ri and $r_j $ fit into each other. The idea is to fill gaps: if
r
i
r_i
ri is contained in $r_j $ it is logical to merge these first in order to avoid any holes. On the other hand, if
r
i
r_i
ri and $r_j $ are hardly touching each other they will likely form a strange region and should not be merged. To keep the measure fast, we use only the size of the regions and of the containing boxes. Specifically, we define
B
B
i
j
B B_{i j}
BBij to be the tight bounding box around
r
i
r_i
ri and $r_j $ . Now
s
f
i
l
l
(
r
i
,
r
j
)
s_{fill} (r_i , r_j )
sfill(ri,rj) is the fraction of the image contained in
B
B
i
j
B B_{i j}
BBij which is not covered by the regions of
r
i
r_i
ri and $r_j $ :
fill
(
r
i
,
r
j
)
=
1
−
size
(
B
B
i
j
)
−
size
(
r
i
)
−
size
(
r
i
)
size
(
i
m
)
\text { fill }\left(r_{i}, r_{j}\right)=1-\frac{\operatorname{size}\left(B B_{i j}\right)-\operatorname{size}\left(r_{i}\right)-\operatorname{size}\left(r_{i}\right)}{\operatorname{size}(i m)}
fill (ri,rj)=1−size(im)size(BBij)−size(ri)−size(ri)
In this paper, our final similarity measure is a combination of the above four:
s
(
r
i
,
r
j
)
=
a
1
s
c
o
l
o
u
r
(
r
i
,
r
j
)
+
a
2
s
t
e
x
t
u
r
e
(
r
i
,
r
j
)
+
a
3
s
s
i
z
e
(
r
i
,
r
j
)
+
a
4
s
f
i
l
l
(
r
i
,
r
j
)
s\left(r_{i}, r_{j}\right)= a_{1} s_{colour}\left(r_{i}, r_{j}\right)+a_{2} s_{texture}\left(r_{i}, r_{j}\right) +a_{3} s_{s i z e}\left(r_{i}, r_{j}\right)+a_{4} s_{f i l l}\left(r_{i}, r_{j}\right)
s(ri,rj)=a1scolour(ri,rj)+a2stexture(ri,rj)+a3ssize(ri,rj)+a4sfill(ri,rj)