StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction

StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction (ECCV 2018)

CVF Mirror Arxiv Mirror

Authors: Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, Shahram Izadi
Affiliations: Google

Here the goal is to do real-time stereo matching at 60 fps.
They claim high-quality, edge-preserved, quantization-free disparity maps.

Method

Their algorithm is as follows:

Extract image features using a Siamese network to a lwoer resolution.
Create a cost volume matching features along scanlines.
Do hierarchical refinement to recover details and structures.

Differentiable arg min

They experimented with soft arg min and probabilistic arg min. They ended up going with soft arg min because it is faster to converge and easier to optimize.

For a pixel \(i\), the optimal disparity is \(d_i = \arg \min_{d} C_i(d)\).

Applying soft arg min, \(d_i = \sum_{d=1}^{D} d \cdot \frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).
Here \(d_i\) is a softmax weighted sum of the costs.

For probabilistic arg-min, the sample the cost proportinally to the softmax weighted sum of the costs:
\(d_i = d\) where \(P(d)=\frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).

Architecture

Feature Network

The use a Siamese (i.e. shared weights between inputs) CNN for feature extraction.
Their CNN consists of \(K=3,4\) \(5 \times 5\) convolutions with stride \(2\) and \(32\) channels.
Then they apply \(6\) residual blocks with \(3 \times 3\) Conv2d, batch norm, and LRelu (\(\alpha=0.2)\).
Finally, the have a \(3 \times 3\) conv layer.
Their final representation has \(32\) channels.

Cost Volume

They create a cost volume by taking the difference between features of the images.

The cost volume is filtered with 3 Conv3d blocks consisting of \(3\times3\times3\) Conv3d, batch-norm, and LRelu.
They have a final \( 3 \times 3 \times 3 \) Conv3d layer. Each filtering layer outputs 1 channel.

Hierarchical Refinement

They use a refinement network to upsample the coarse disparity.

First the concatenate the bilinear upsampled disparity and the color image.
Then they pass through a \(3\times3\) convolution to a 32-dimensional image.
Then they pass through 6 residual blocks of \(3\times3\) dilated convolutions, batch-norm, LReLU (\alpha=0.2).
Dilations are \(1, 2, 4, 8, 1, 1\).
Then finally a \(1 \times 1\) convolution to 1 channel of disparity.

Loss function

They supervise the network at every refinement level: \[L = \sum_{k} \rho (d_i^k - \hat{d_i})\] where:

StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction

Contents

Method

Differentiable arg min

Architecture

Feature Network

Cost Volume

Hierarchical Refinement

Loss function

Evaluation

Resources

Unofficial Implementations

References