Notes for SIFT algorithm

Declare:

  1. SIFT method is proposed by David G. Lowe. One can refer to his two papers ''Object recognition from local scale-invariant features ‘’ and ‘‘Distinctive Image Features from Scale-Invariant Keypoints’’.
  2. All pictures in this blog are from internet. We give the links where we copy these figures at the end of this note, which links to good interpretations to SIFT algorithm.
  3. Thanks for the authors. You may send email to me1 if you think the content invalid your right.
  4. I am new in image processing. Any discussion or comments are welcome.

SIFT (Scale-invariant feature transform), first proposed in 1999 by David G. Lowe, was then refined and published in the paper titled “Distinctive Image Features from Scale-Invariant Keypoints ( 2004)”, which was cited 49482 times up to now.

The target of SIFT method is to find stable keypoints among the image pixels, and also generate descriptors of these keypoints which are as invariant as possible to remaining variations, such as change in illumination or 3D viewpoint. The major steps are as follows:

  1. Scale-space extrema detection: Using the DoG function to identify potential interest points that are invariant to scale and orientation.
  2. Keypoint localization: At each candidate location, determine the location and scale, select keypoints based on measures of their stability.
  3. Orientation assignment: Assign one or more orientations to each keypoint location based on local image gradient directions.
  4. Keypoint descriptor: Measure the local image gradients at the selected scale in the region around each keypoint and then transform the representation into a descriptor vector which allows for significant level of local shape distoration and change in illumination.

Step 1 Detection of scal-space extrema

We want to find locations and scales that can be repeatably assigned under differing views of the same object.

1.1 Scale space and DoG

Detecting locations that are invariant can be accomplished by searching for stable changes across all possible scales, using a continuous function of scale known as scale space. The scale space of an image I ( x , y ) I(x,y) I(x,y) is
L ( x , y , σ ) = G ( x , y , σ ) ∗ I ( x , y ) , L(x,y,\sigma) = G(x,y,\sigma)*I(x,y), L(x,y,σ)=G(x,y,σ)I(x,y),
where ∗ * is the convolution operation in x x x and y , y, y, and
G ( x , y , σ ) = 1 2 π σ 2 e − x 2 + y 2 2 σ 2 . G(x,y,\sigma)=\frac{1}{2\pi \sigma^2}e^{-\frac{x^2+y^2}{2\sigma^2}}. G(x,y,σ)=2πσ21e2σ2x2+y2.
It is discovered that the maxima and minima of the scale-normalized Laplacian of Gaussian, σ 2 Δ 2 ( G ) \sigma^2 \Delta^2(G) σ2Δ2(G) produce the most stable image features. We use the difference-of-Gaussian function D ( x , y , σ ) D(x,y,\sigma) D(x,y,σ) to approximate σ 2 Δ 2 ( G ) , \sigma^2 \Delta^2(G), σ2Δ2(G), with
D ( x , y , σ ) = L ( x , y , k σ ) − L ( x , y , σ ) . D(x,y,\sigma)=L(x,y,k\sigma)-L(x,y,\sigma). D(x,y,σ)=L(x,y,kσ)L(x,y,σ). In Lowe (2014), it is proved that
G ( x , y , k σ ) − G ( x , y , σ ) ≈ ( k − 1 ) σ 2 Δ 2 ( G ) . G(x,y,k\sigma)-G(x,y,\sigma)\approx (k-1)\sigma^2\Delta^2(G). G(x,y,kσ)G(x,y,σ)(k1)σ2Δ2(G).
The construction of DoG is illustrated as在这里插入图片描述
For details one can refer to Lowe (2014).

1.2 Local extrema detection

In order to detect the local maxima and minima of D ( x , y , σ ) , D(x,y,\sigma), D(x,y,σ), each sample point given by DoG is compared to its eight neighbors in the current image and nine neighbors in the scale above and below. See the following Figure.
在这里插入图片描述

1.3 Frequency of sampling in scale (parameter s=3)

It is shown that the highest repeatability is obtained when sampling 3 scales per octave. The reason that the repeatability does continue to improve as more scale are sampled is that, it results in many more local extrema being detected, which are on average less stable and therefor are less likely to be detected in the transformed image.

1.4 Frequency of sampling in the spatial domain (parameter σ = 1.6 \sigma= 1.6 σ=1.6)

Experiments show that σ = 1.6 \sigma= 1.6 σ=1.6 provides close to optimal repeatability. To make full use of the input, the image can be expanded to create more sample points than were present in the original. Therefore, the size of the original image is doubled by using linear interpolation prior to building the first level of the pyramid.The image doubling increases the number of stable keypoints by almost a factor of 4.

Step 2 Accurate keypoint localization

Once a keypoint candidate has been found by comparing a pixel to its neighbours, the next step is to perform a detailed fit to the nearby data for location, scale, and ration of principal curvatures. The information allows points to be rejected that have low contrast (and are therefore sensitive to noise) or are poorly localized along an edge.

2.1 Reject low contrast keypoints

Given Taylor expansion (up to the quadratic terms) of the scale-space function D ( x , y , σ ) D(x,y,\sigma) D(x,y,σ) at the sample point as follows
D ( x ) = D + ∂ D ⊤ ∂ x x + 1 2 x ⊤ ∂ 2 D ⊤ ∂ x 2 x , D(\bold{x})=D+\frac{\partial D^{\top}}{\partial \bold{x}}\bold{x}+\frac{1}{2}\bold{x}^{\top}\frac{\partial ^2D^{\top}}{\partial \bold{x}^2}\bold{x}, D(x)=D+xDx+21xx22Dx,where D D D and its derivatives are evaluated at the sample point and x = ( x , y , σ ) \bold{x}=(x,y,\sigma) x=(x,y,σ) is the offset from this point. The location of the extremum x ^ \hat{\bold{x}} x^ is determined by taking the derivative of this function with respect to x \bold{x} x and setting it to zero giving
x ^ = − ∂ 2 D − 1 ∂ x 2 ∂ D ∂ x . \hat{\bold{x}}=-\frac{\partial ^2D^{-1}}{\partial \bold{x}^2}\frac{\partial D}{\partial \bold{x}}. x^=x22D1xD.
If the offset x ^ \hat{\bold{x}} x^ is larger than 0.5 in any dimension ( x x x or y y y or σ \sigma σ), it means that the extremum lies closer to a differen sample keypoint. The sample point is changed and the interpolation performed instead about that point. The final offset x ^ \hat{\bold{x}} x^ is added to the location of its sample keypoint to get interpolated estimate for the location of the extremum. The function value at the extrema D ( x ^ ) D(\hat{\bold{x}}) D(x^) is useful for rejecting unstable extrema with low contrast. All extrema with a value of ∣ D ( x ^ ) ∣ |D(\hat{\bold{x}})| D(x^) less than 0.03 is discarded (assuming the pixel values in the range [0,1].)

2.2 Eliminating edge responses

The DoG function will have a strong response along edges even if the keypoint along the edge is poorly determined and unstable to small amounts. The poorly defined peak in the DoG function will have a large principal curvature across the edge but a small one in the perpendicular direction.

在这里插入图片描述

The principal curvature can be computed from
H = [ D x x D x y D y x D y y ] . \bold{H}=\begin{bmatrix} D_{xx} & D_{xy}\\ D_{yx} & D_{yy} \end{bmatrix}. H=[DxxDyxDxyDyy].Let r r r be the ration between the largest magnitude eigenvalue α \alpha α and t\he smaller one β , \beta, β, i.e. α = r β . \alpha=r \beta. α=rβ. Then
T r ( H 2 ) D e t H = ( r + 1 ) 2 r . \frac{Tr(\bold{H}^2)}{Det{H}}=\frac{(r+1)^2}{r}. DetHTr(H2)=r(r+1)2.To check the ration of the principal curvatures is below some threshold r , r, r, we only need to check if
T r ( H 2 ) D e t H &lt; ( r + 1 ) 2 r . \frac{Tr(\bold{H}^2)}{Det{H}}&lt;\frac{(r+1)^2}{r}. DetHTr(H2)<r(r+1)2. Lowe (2014) eliminate keypoints that have a ration between the principal curvatures greater than 10.

Step 3 Orientation assignment

By assigning a consistent orientation to each keypoint based on local image properties, the keypoint descriptor can be represented relative to this orientation and achieve in-variance to image rotation. The scale of the keypoint is used to select the Gaussian smoothed image with the closest scale so that all computations are performed in a scale-invariant manner. For each image sample L ( x , y ) L(x,y) L(x,y) at this scale, the gradient magnitude m ( x , y ) m(x,y) m(x,y) and orientation θ ( x , y ) \theta(x,y) θ(x,y) is precomputed by using pixel differences
m ( x , y ) = ( L ( x + 1 , y ) − L ( x − 1 , y ) ) 2 + ( L ( x , y + 1 ) − L ( x , y − 1 ) ) 2 m(x,y)=\sqrt{(L(x+1,y)-L(x-1,y))^2+(L(x,y+1)-L(x,y-1))^2} m(x,y)=(L(x+1,y)L(x1,y))2+(L(x,y+1)L(x,y1))2 θ ( x , y ) = tan − 1 [ L ( x + 1 , y ) − L ( x − 1 , y ) ] / [ L ( x , y + 1 ) − L ( x , y − 1 ) ] . \theta(x,y)=\text{tan}^{-1}{[L(x+1,y)-L(x-1,y)]/[L(x,y+1)-L(x,y-1)]}. θ(x,y)=tan1[L(x+1,y)L(x1,y)]/[L(x,y+1)L(x,y1)]. An orientation histogram is formed form the gradient orientations of sample points within a region around the keypoint. The orientation histogram has 36 bins covering 360 degree range. Each sample added to the histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with a θ \theta θ being 1.5 times that of the scale of the keypoint. Peaks in the orientation histogram correspond to dominant directions of local gradients. Any local peak that is within 80 % 80\% 80% of the highest peak is used to create a keypoint with that orientation. Therefore, for locations with multiple peaks of similar magnitude, there will be multiple keypoints created at the same location and scale but different orientations. Only 15 % 15\% 15% of points are assigned multiple orientations, but these significantly contribute to the stability of matching. Finally, a parabola is fit to the 3 histogram values closest to each peak to interpolate the peak position for better accuracy. Then for each keypoint, it can be described by ( x , y , θ , θ ) (x,y,\theta, \theta) (x,y,θ,θ), i.e. image location, scale and orientation. Details can is given in the following figure.
在这里插入图片描述
Up to now the orientation assigned to the keypoints is in fact a interval value associated to the peak or local peak of the histogram. Next we use a parabola fit to the 3 histogram values closest to each peak to interpolate the peak position for better accuracy.

Given by Openvc, In case the sudden change of the orientation of gradients influenced by noise, the histogram is smoothed as follows:在这里插入图片描述 where h ( i ) h(i) h(i) and H ( i ) H(i) H(i) indicate the i i ith rectangle of the original and smoothed histogram respectively.

The coordinates of three points can determine a parabola. See the following figure.
在这里插入图片描述

Step 4 The local image descriptor

The next step is to compute a descriptor for the local image region that is highly distinctive yet is as invariant as possible to remaining variations.

4.1 Rotation in-variance

在这里插入图片描述
Take the keypoint E as well as the points [A, B, C , D] in its neighbor in the above figure as an example. If the original image is rotated 90 degrees clockwise, we will get [D, A, B, C] after rotation, which is totally different from the original samples around the keypoint E. Therefore, we rotate the coordinate such that the orientation of the keypoint is parallel to the x-axis to achieve orientation in-variance. The samples around the keypoint after rotation are illustrated as follows.
在这里插入图片描述
The coordinates of the points in the image rotated can be obtained by
在这里插入图片描述
where θ \theta θ is the angle between the orientation of the keypoint and the x-axis.

4.2 Gaussian weighting and trilinear interpolation

To avoid sudden change in the descriptor with small changes in the position of the region around the keypoint and to give less emphasis to gradients that are far from the center of the descriptor, which are most affected by misregistration errors, we use a Gaussian weighting function with σ \sigma σ equal to one half the width of the descriptor window to assign a weight to the magnitude of each sample point. Of course, the weight falls off smoothly.
The descriptor may change abruptly when the sample point shifts smoothly from being in one histogram to another or from one orientation to another.
在这里插入图片描述
For example, if we create a 4 × 4 4\times4 4×4 subregion to describe the gradient orientation of samples by using 4 × 4 4\times4 4×4 histogram to space the orientation in 8 bins in each subregion. Therefore, we have 8 number corresponding to 8 orientation interval ( [ 45 ( n − 1 ) + 1 , 45 n ] [45(n-1)+1,45n ] [45(n1)+1,45n]) in each subregion. We need 4 × 4 × 8 4\times4\times8 4×4×8 number to illustrate the gradient information of the samples around the keypoint. If the sample is near the boundary of the subregion, or its gradient orientation is near the boundary of the bin ( 4 4 0 44^0 440 for example), the value of the descriptor may change abruptly. Hence, we smooth the histogram in the above cube by a trilinear interpolation, which means to distribute the value of each gradient orientation into its adjacent bin in the cube. Take the red point in the square as an example. If the location of this sample to the x coordinate of the center of its subregion is d r , d_r, dr, to the y coordinate of the center of its subregion is d v d_v dv, and the orientation to the orientation of the center of its bin is d θ , d_{\theta}, dθ, then we have its weight being
w = m ∗ e − ( x ′ ) 2 + ( y ′ ) 2 2 σ ∗ ( 1 − d r ) ∗ ( 1 − d v ) ∗ ( 1 − d θ ) . w=m*e^{-\frac{(x&#x27;)^2+(y&#x27;)^2}{2\sigma}}*(1-d_r)*(1-d_v)*(1-d_{\theta}). w=me2σ(x)2+(y)2(1dr)(1dv)(1dθ).
Here m m m is the original magnitude of the gradient direction. The value of the bin, i.e. the side length of the subregin, length of the interval, is measured in units of the histogram bin spacing.

4.3 Normalization and thresholding

We normalize the descriptor to unit length to reduce effects of illumination change. A change in contrast means to multiply gradient of each pixel value by the same gradient and a brightness change means to add a constant to each pixel. Hence normalization ensure the descriptor vector to be invariant to affine changes in illumination.

For non-linear illumination changes which may cause a large in relative manitudes for some gradients but less likely affect the gradient orientation. Therefore we throsholding the values in the feature vector to each be no larger than 0.2, and then renormalizing it to unit length.

SIFT中文详解链接1
SIFT中文详解链接2
SIFT中文详解链接3
SIFT中文详解链接4
SIFT和SURF中文详解
SIFT中文详解链接5

SIFT MatlabCode 链接


  1. cxx625188@163.com ↩︎

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值