Distinctive Image Features from Scale-Invariant Keypoints

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Distinctive Image Features from Scale-Invariant Keypoints Also known as Scale-invariant feature transform (SIFT) or SIFT features.

Author: David G. Lowe
Affiliation: University of British Columbia

Algorithm

Scale-space extrema detection

\(L(x,y,\sigma)\) is the scale space of an image. This is generated by the convolution of a Gaussian with an input image: \[ \begin{align} L(x,y,\sigma) = G(x, y, \sigma) * I(x,y) \end{align} \] where \(G(x,y,\sigma) = \frac{1}{2 \pi \sigma^2}\exp{-(x^2+y^2)/(2\sigma^2)}\).

This is used to compute the difference-of-Gaussian (DoG) for two scales separated by \(k\): \[ \begin{align} D(x,y,\sigma) &= (G(x,y,k\sigma) - G(x,y,\sigma)) * I(x,y)\nonumber \\ &= L(x,y,k\sigma) - L(x,y,\sigma) \end{align} \]

For your image, you create multiple octaves by down-sampling the image by 2. Within each octave, you have multiple scales with varying \(\sigma\).
E.g. if you have 5 scales per octave, then you will have 4 DoG images.

To find local min/max, compare each pixel in each DoG image to its 8 neighboring pixels within the image and 9 neighboring pixels in each neighboring scale. In total, it will be the min/max among (8+9+9=26) neighboring pixels. This is only done on the intermediate DoG images (i.e. excluding the first and last).

In his paper, Lowe uses \(k=\sqrt{2}=2^{1/s}\) with \(s=2\). This means generating \(s+3=5\) Gaussian blurred images for each octave.

Keypoint localization

Orientation assignment

Keypoint descriptor