Optical Flow Estimation using a Spatial Pyramid Network

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Optical Flow Estimation using a Spatial Pyramid Network (CVPR 2017)

Authors: Anurag Ranjan, Michael J. Black
Affiliation: Max Plan

The goal of this paper is to calculate optical flow for both large and small/slow motions using deep learning.

Method

The main idea is to use a coarse-to-fine pyramid structure.
Resize images into a pyramid and calculate the optical flow from the top of the pyramid downwards, each time applying the optical flow from above. Each level has its own deep network. Large motions are dealt with by the higher levels of the pyramid so each network only deals with smaller motions; i.e. corrections from the flow of the previous level.

Let the following:

  • \(d\) is a downsampling operator creates images of resolution \(m/2 \times n/2\).
  • \(u\) is the upsampling operator
  • \(w(I, V\) warps image \(I\) according to flow \(V\)
  • \(\{G_0,..., G_k\}\) the convolutional networks which compute residual flow

At each level \(k\), the network \(G_k\) computes a residual flow \[ \begin{equation} v_k = G_k(I_k^1, w(I_k^2, u(V_{k-1})), u(V_{k-1})) \end{equation} \] This residual flow is added to the upsampled flow from the previous level: \[ \begin{equation} V_k = u(V_{k-1}) + v_k \end{equation} \]

Architecture

They train 5 convolutional networks: \(\{G_0, ..., G_4\}\).
Each network has 5 \(7\times 7\) convolutional layers. Each convolutional layer is followed by ReLU except the last one. The feature maps have sizes: \(\{32, 64, 32, 16, 2\}\). The input to each consists of 8 channels: 3 for each of the images and 2 for the flow from the previous layer.

The networks are trained independently and sequentially.
They use Adam with \(\beta_1 = 0.9\) and \(\beta_2 = 0.999\).
Batch size is 32, 4000 iterations per epoch.
Learning rate is \(1e-4\) for the first 60 then \(1e-5\) until convergence.

They also apply the following data augmentations:

  • random scaling 1-2
  • random rotations: -17 deg to 17 deg
  • random crop
  • additive white Gaussian noise
  • color jitter

Evaluation

They train on the Flying Chairs dataset and report performance on Flying Cnairs, Sintel, Middlebury, and KITTI.