StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction

From David's Wiki
\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction (ECCV 2018)

Authors: Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, Shahram Izadi
Affiliations: Google

Here the goal is to do real-time stereo matching at 60 fps.
They claim high-quality, edge-preserved, quantization-free disparity maps.

Method

Their algorithm is as follows:

  • Extract image features using a Siamese network to a lwoer resolution.
  • Create a cost volume matching features along scanlines.
  • Do hierarchical refinement to recover details and structures.

Differentiable arg min

They experimented with soft arg min and probabilistic arg min. They ended up going with soft arg min because it is faster to converge and easier to optimize.

For a pixel \(i\), the optimal disparity is \(d_i = \arg \min_{d} C_i(d)\).

Applying soft arg min, \(d_i = \sum_{d=1}^{D} d \cdot \frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).
Here \(d_i\) is a softmax weighted sum of the costs.

For probabilistic arg-min, the sample the cost proportinally to the softmax weighted sum of the costs:
\(d_i = d\) where \(P(d)=\frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).

Architecture

Feature Network

The use a Siamese (i.e. shared weights between inputs) CNN for feature extraction.
Their CNN consists of \(K=3,4\) \(5 \times 5\) convolutions with stride \(2\) and \(32\) channels.
Then they apply \(6\) residual blocks with \(3 \times 3\) Conv2d, batch norm, and LRelu (\(\alpha=0.2)\).
Finally, the have a \(3 \times 3\) conv layer.
Their final representation has \(32\) channels.

Cost Volume

They create a cost volume by taking the difference between features of the images.

The cost volume is filtered with 3 Conv3d blocks consisting of \(3\times3\times3\) Conv3d, batch-norm, and LRelu.
They have a final \( 3 \times 3 \times 3 \) Conv3d layer. Each filtering layer outputs 1 channel.

Hierarchical Refinement

They use a refinement network to upsample the coarse disparity.

First the concatenate the bilinear upsampled disparity and the color image.
Then they pass through a \(3\times3\) convolution to a 32-dimensional image.
Then they pass through 6 residual blocks of \(3\times3\) dilated convolutions, batch-norm, LReLU (\alpha=0.2).
Dilations are \(1, 2, 4, 8, 1, 1\).
Then finally a \(1 \times 1\) convolution to 1 channel of disparity.

Loss function

They supervise the network at every refinement level: \[L = \sum_{k} \rho (d_i^k - \hat{d_i})\] where:

  • \(\rho\) approximates a smoothed L1 loss.

Evaluation

The evaluated on Scene Flow, KITTI 2012, and KITTI 2015.

Resources

Unofficial Implementations

References