Distinctive Image Features from Scale-Invariant Keypoints

\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Distinctive Image Features from Scale-Invariant Keypoints Also known as Scale-invariant feature transform (SIFT) or SIFT features.

Author: David G. Lowe
Affiliation: University of British Columbia

Algorithm

Scale-space extrema detection

\(L(x,y,\sigma)\) is the scale space of an image. This is generated by the convolution of a Gaussian with an input image: \[ \begin{align} L(x,y,\sigma) = G(x, y, \sigma) * I(x,y) \end{align} \] where \(G(x,y,\sigma) = \frac{1}{2 \pi \sigma^2}\exp{-(x^2+y^2)/(2\sigma^2)}\).

This is used to compute the difference-of-Gaussian (DoG) for two scales separated by \(k\): \[ \begin{align} D(x,y,\sigma) &= (G(x,y,k\sigma) - G(x,y,\sigma)) * I(x,y)\nonumber \\ &= L(x,y,k\sigma) - L(x,y,\sigma) \end{align} \]

For your image, you create multiple octaves by down-sampling the image by 2. Within each octave, you have multiple scales with varying \(\sigma\).
E.g. if you have 5 scales per octave, then you will have 4 DoG images.

To find local min/max, compare each pixel in each DoG image to its 8 neighboring pixels within the image and 9 neighboring pixels in each neighboring scale. In total, it will be the min/max among (8+9+9=26) neighboring pixels. This is only done on the intermediate DoG images (i.e. excluding the first and last).

In his paper, Lowe uses \(k=\sqrt{2}=2^{1/s}\) with \(s=2\). This means generating \(s+3=5\) Gaussian blurred images for each octave.

Keypoint localization

Orientation assignment

  • For every keypoint, take a patch around the keypoint and compute a gradient (norm, angle) for each pixel in the patch.
  • Build a histogram of gradients, binned along angles. Smooth this histogram by fitting a parabola to 3 values closest to each peak.
  • The peak of the histogram is the dominant orientation

Keypoint descriptor

  1. First we do normalization:
    • Rotate the window to standard orientation (orientation up)
    • Scaled window size
  2. For an 16x16 patch around the keypoint, compute gradients (direction, norm) and weight by gaussian distance to the center.
  3. Split the 16x16 patch to 4x4 patches (each of 4x4 pixels) and generate a histogram (8-dim vector) for each patch. In total, you will have 16 histograms, each a 8-dim vector for a total of 128. Concatenate these to a 128-dim feature vector.
  4. Normalize the feature vector to unit length.
Notes
  • SIFT is hard to implement due to all the details. Always use a library.
  • The smoothness from Gaussian weighting is important so that the descriptor doesn't change too much if the location of the keypoint isn't exactly the same between images.
  • Gradient magnitudes are clipped so the histogram doesn't overweigh a single gradient.
  • Normalization to 1 is important for being robust to illumination change.