Monocular Neural Image Based Rendering with Continuous View Control

Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich

Method

The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.

Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
Decode the latent representation into a depth map for the target view \(D_t\).
Compute correspondences between source and target using projection to the depth map
- Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
Do warping using correspondences to get the target image \(\hat{I}_{t}\).

In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]

where:

\(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
\(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)

Transforming Auto-encoder

The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\): \[ \begin{equation} z_t = [R|t]_{s\to t} \tilde{z}_s \end{equation} \]

Depth Guided Appearance Mapping

Architecture

The only neural network they use is a transforming autoencoder.
Details about their network are provided in the supplementary details as well as in the code.
Their implementation is based on Zhou et al. View Synthesis by Appearance Flow^[1].

The encoder converts images into latent points.
It consists of 7 convolutional blocks which each downsample the feature map.
Each block is: Conv-BatchNorm-LeakyReLU.
The output of the convolutional blocks are put through a fully connected layer and reshaped into a \(200 \times 3\) matrix.

The decoder renders the latent points into a depth map from the target view.
It consists of 7 blocks of: Upsample-Conv-BatchNorm-LeakyReLU.
They use bilinear upsampling.

Evaluation

Their evaluation is performed on ShapeNet and KITTI.

References

↑ Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[zhou2016view-1] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[1]