Distinctive Image Features from Scale-Invariant Keypoints: Difference between revisions

Created page with "Distinctive Image Features from Scale-Invariant Keypoints Also known as Scale-invariant feature transform (SIFT) or SIFT features. Author: David G. Lowe Affiliation: Univer..."
 
 
(2 intermediate revisions by the same user not shown)
Line 9: Line 9:
==Algorithm==
==Algorithm==
===Scale-space extrema detection===
===Scale-space extrema detection===
\(L(x,y,\sigma)\) is the scale space of an image. This is generated by the convolution of a Gaussian with an input image:
<math display="block">
\begin{align}
L(x,y,\sigma) = G(x, y, \sigma) * I(x,y)
\end{align}
</math>
where <math display="inline">G(x,y,\sigma) = \frac{1}{2 \pi \sigma^2}\exp{-(x^2+y^2)/(2\sigma^2)}</math>.


This is used to compute the difference-of-Gaussian (DoG) for two scales separated by \(k\):
<math display="block">
\begin{align}
D(x,y,\sigma) &= (G(x,y,k\sigma) - G(x,y,\sigma)) * I(x,y)\nonumber \\
&= L(x,y,k\sigma) - L(x,y,\sigma)
\end{align}
</math>
For your image, you create multiple octaves by down-sampling the image by 2.
Within each octave, you have multiple scales with varying \(\sigma\). 
E.g. if you have 5 scales per octave, then you will have 4 DoG images.
To find local min/max, compare each pixel in each DoG image to its 8 neighboring pixels within the image and 9 neighboring pixels in each neighboring scale. In total, it will be the min/max among (8+9+9=26) neighboring pixels. This is only done on the intermediate DoG images (i.e. excluding the first and last).
In his paper, Lowe uses \(k=\sqrt{2}=2^{1/s}\) with \(s=2\). This means generating \(s+3=5\) Gaussian blurred images for each octave.


===Keypoint localization===
===Keypoint localization===
===Orientation assignment===
===Orientation assignment===
* For every keypoint, take a patch around the keypoint and compute a gradient (norm, angle) for each pixel in the patch.
* Build a histogram of gradients, binned along angles. Smooth this histogram by fitting a parabola to 3 values closest to each peak.
* The peak of the histogram is the ''dominant orientation''
===Keypoint descriptor===
===Keypoint descriptor===
# First we do normalization:
#* Rotate the window to standard orientation (orientation up)
#* Scaled window size
# For an 16x16 patch around the keypoint, compute gradients (direction, norm) and weight by gaussian distance to the center.
# Split the 16x16 patch to 4x4 patches (each of 4x4 pixels) and generate a histogram (8-dim vector) for each patch. In total, you will have 16 histograms, each a 8-dim vector for a total of 128. Concatenate these to a 128-dim feature vector.
# Normalize the feature vector to unit length.
;Notes
* SIFT is hard to implement due to all the details. Always use a library.
* The smoothness from Gaussian weighting is important so that the descriptor doesn't change too much if the location of the keypoint isn't exactly the same between images.
* Gradient magnitudes are clipped so the histogram doesn't overweigh a single gradient.
* Normalization to 1 is important for being robust to illumination change.