Learning Independent Object Motion from Unlabelled Stereoscopic Videos

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Learning Independent Object Motion from Unlabelled Stereoscopic Videos (CVPR 2019)

Authors: Zhe Cao, Abhishek Kar, Christian Haene, Jitendra Malik
Affiliations: UC Berkeley, Fyusion Inc, Google

Method

Key Contributions
  • Learning with limited supervision
  • Factoring the scene into independent moving objects (main idea of the paper)
  • Designing a network architecture using place sweep volumes
Inputs
  • Image pairs \(\{(I_1^l, I_1^r),..., (I_n^l, I_n^r)\}\) from unlabelled stereo videos
  • Object bounding boxes \(B = \{B^1,..., B^j\}\) on the left image \(I_t^l\) from off-the-shelf object detectors
Goal/Outputs
  • Dense depth map \(D\)
  • 3D flow fields \(F = \{F^1,..., F^j\}\)
  • Instance masks \(M=\{M^1,..., M^j\}\)
  • For each region of interest RoI, predict a per-object flow map using a RCNN
    • Also predict a object mask for each RoI
  • Construct a full 3D scene flow map using the per-object flow maps.

Self Supervision and Loss Functions

  • View Synthesis
  • Geometric consistency: The depth values of the warped image and the reference image should match
  • Left Right consistency \(L^{lr}\)
  • RoI Loss \(L^{roi}\)
  • Full image based loss \(L^{t}\)

Architecture

Evaluation

References