Monocular Neural Image Based Rendering with Continuous View Control

Monocular Neural Image Based Rendering with Continuous View Control (ICCV 2019)

Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich

Method

The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.

Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
Decode the latent representation into a depth map for the target view \(D_t\).
Compute correspondences between source and target using projection to the depth map
- Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
Do warping using correspondences to get the target image \(\hat{I}_{t}\).

In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]

where:

\(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
\(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)

Transforming Auto-encoder

The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\). \[ \begin{equation} z_t = T_{s \to t} \tilde{z}_s \end{equation} \]

Depth Guided Appearance Mapping

This is just projection and warping.

For each pixel in the target image, extract an \((x,y,z)\) coordinate using the depth map.
Project the \((x,y,z)\) coordinate to the source image to get UV coordinates.
Do bilinear interpolation to get the pixel colors from the UV coordinates.

In the end, they train using an l1 reconstruction loss: \[ \begin{equation} L_{recon} = \Vert I_t - \hat{I}_t \Vert_1 \end{equation} \]

Architecture

The only neural network they use is a transforming autoencoder.
Details about their network are provided in the supplementary details as well as in the code.
Their implementation is based on Zhou et al. View Synthesis by Appearance Flow^[1].

The encoder converts images into latent points.
It consists of 8 convolutional blocks which each downsample the feature map. (Note that the supplementary material says 7 but their code actually uses 8).
Each block is: conv-BatchNorm-LeakyReLU.
Each convolutional layer uses a 4x4 kernel with stride 2 and padding 1 which haves the resolution \(((x−4+2)/2+1)=x/2\).
The final output of the convolution blocks has size \((1, 1, 2**8)\).
The output of the convolutional blocks are put through a fully connected layer and reshaped into a \(200 \times 3\) matrix.

The decoder renders the latent points into a depth map from the target view.
It consists of 8 blocks of: Upsample-ReflectionPad-Conv-BatchNorm-LeakyReLU.
The upsample layer doubles the width and height using bilinear interpolation.

Optimizer: Adam
- learning_rate=0.00006, beta_1=0.5, beta_2=0.999

Evaluation

Their evaluation is performed on ShapeNet and KITTI.

References

↑ Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[zhou2016view-1] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[1]