Monocular Neural Image Based Rendering with Continuous View Control
Monocular Neural Image Based Rendering with Continuous View Control
Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich
Method
The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.
- Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
- Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
- Decode the latent representation into a depth map for the target view \(D_t\).
- Compute correspondences between source and target using projection to the depth map
- Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
- Do warping using correspondences to get the target image \(\hat{I}_{t}\).
In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]
where:
- \(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
- \(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)
Transforming Auto-encoder
The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\):
\[
\begin{equation}
z_t = [R|t]_{s\to t} \tilde{z}_s
\end{equation}
\]
Depth Guided Appearance Mapping
Architecture
The only neural network they use is a transforming autoencoder.
Details about their network are provided in the supplementary details as well as in the code.
Their encoder converts images into latent points.
It consists of 7 convolutional blocks which each downsample the feature map.
Each block is: Conv-BatchNorm-LeakyReLU.
The output of the convolutional blocks are put through a fully connected layer and reshaped into a \(200 \times 3\) matrix.
Their decoder renders the latent points into a depth map from the target view.
It consists of 7 blocks of: Upsample-Conv-BatchNorm-LeakyReLU.
They use bilinear upsampling.