SfM-Net: Learning of Structure and Motion from Video

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

SfM-Net: Learning of Structure and Motion from Video (2017)

Authors: Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, Katerina Fragkiadaki Affiliations: Google, Inria Research Institute, CMU

SfM-Net is a geometry-aware neural network for motion estimation in videos.
From two video frames, the CNN can regress scene depth, camera rotation+translation, motion masks, and 3D rigid rotations and translations.
These can be backprojected into 3D scene flow and 2D optical flow for frame interpolation.

Method

The first build two networks to estimate the following for time \(t\):

  • Image depth \(d_t \in [0,\infty)^{w \times h} \).
  • Camera rotation and translation: \(\{R_t^c, t_t^c \}\)
  • K motion masks: \(m_t^k\) for \(k=1,...,K\)
  • K object motions: \(\{R_t^k, t_t^k \}\)

The data above is used as follows:

  • Use the depth to generate a point cloud
  • Transform the point cloud based on object transformations
  • Transform the point cloud based on camera transformations
  • Compute optical flow and do warping
  • Repeat with \(I_{t+1}\) and \(I_{t}\) for backward consistency

Supervision/Loss Function

They apply several forms of supervision:

  • Self-supervision: Minimize distance between the reference and the warped image
  • Spatial smoothness priors: Penalize l1 norm of gradients on optical flow field, depth, and motion maps
  • Forward-backward consistency constraints: Do a run backwards in time and make sure the depth \(d_{t+1}\) is consistent with \(d_t\)
  • Supervising depth: Minimize estimate and ground truth depth
  • Supervising camera motion: Minimize estimate and ground truth camera motion
  • Supervising optical flow and object motion: Minimize estimate and ground truth optical flow and object motion on synthetic datasets

Architecture

SfM-Net consists of two neural networks:

  • The motion network estimates camera motion, object motion, and object masks.
  • The structure network estimates depth what can be used to make a point cloud.

Both networks follow a Conv-Deconv (U-Net) structure with skip connections.
See the figure in the paper for more details

Motion Network

The inputs to the motion network are a pair of video frames \(I_t\) and \(I_{t+1}\) totaling a tensor with shape (\(380 \times 128 \times 6\).
From this, the motion network predicts the following:

  • Camera rotation and translation: \(\{R_t^c, t_t^c \}\)
  • K motion masks: \(m_t^k\) for \(k=1,...,K\)
  • K object motions: \(\{R_t^k, t_t^k \}\)


Structure Network

The goal of the structure network is to estimate depth \(d_t \in [0,\infty)^{w \times h} \).


Evaluation

The do an evaluation on the KITTI 2012 & 2015 datasets, MoSeg dataset, and RGB-D SLAM datasets.
They only provide comparisons to ground-truth and not other approaches.

References