StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction
StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction (ECCV 2018)
Authors: Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, Shahram Izadi
Affiliations: Google
Here the goal is to do real-time stereo matching at 60 fps.
They claim high-quality, edge-preserved, quantization-free disparity maps.
Method
Their algorithm is as follows:
- Extract image features using a Siamese network to a lwoer resolution.
- Create a cost volume matching features along scanlines.
- Do hierarchical refinement to recover details and structures.
Differentiable arg min
They experimented with soft arg min and probabilistic arg min. They ended up going with soft arg min because it is faster to converge and easier to optimize.
For a pixel \(i\), the optimal disparity is \(d_i = \arg \min_{d} C_i(d)\).
Applying soft arg min, \(d_i = \sum_{d=1}^{D} d \cdot \frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).
Here \(d_i\) is a softmax weighted sum of the costs.
For probabilistic arg-min, the sample the cost proportinally to the softmax weighted sum of the costs:
\(d_i = d\) where \(P(d)=\frac{\exp(-C_i(d))}{\sum_{d'} \exp(-C_i)(d'))}\).
Architecture
Feature Network
The use a Siamese (i.e. shared weights between inputs) CNN for feature extraction.
Their CNN consists of \(K=3,4\) \(5 \times 5\) convolutions with stride \(2\) and \(32\) channels.
Then they apply \(6\) residual blocks with \(3 \times 3\) Conv2d, batch norm, and LRelu (\(\alpha=0.2)\).
Finally, the have a \(3 \times 3\) conv layer.
Their final representation has \(32\) channels.
Cost Volume
They create a cost volume by taking the difference between features of the images.
The cost volume is filtered with 3 Conv3d blocks consisting of \(3\times3\times3\) Conv3d, batch-norm, and LRelu.
They have a final \( 3 \times 3 \times 3 \) Conv3d layer. Each filtering layer outputs 1 channel.
Hierarchical Refinement
They use a refinement network to upsample the coarse disparity.
First the concatenate the bilinear upsampled disparity and the color image.
Then they pass through a \(3\times3\) convolution to a 32-dimensional image.
Then they pass through 6 residual blocks of \(3\times3\) dilated convolutions, batch-norm, LReLU (\alpha=0.2).
Dilations are \(1, 2, 4, 8, 1, 1\).
Then finally a \(1 \times 1\) convolution to 1 channel of disparity.
Loss function
They supervise the network at every refinement level: \[L = \sum_{k} \rho (d_i^k - \hat{d_i})\] where:
- \(\rho\) approximates a smoothed L1 loss.
Evaluation
The evaluated on Scene Flow, KITTI 2012, and KITTI 2015.