Monocular Neural Image Based Rendering with Continuous View Control: Difference between revisions

From David's Wiki
No edit summary
Line 5: Line 5:
Affiliations: AIT Lab, ETH Zurich
Affiliations: AIT Lab, ETH Zurich


* [https://arxiv.org/abs/1901.01880 Arxiv mirror][http://openaccess.thecvf.com/content_ICCV_2019/papers/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.pdf CVF Mirror][https://ieeexplore.ieee.org/document/9008541 IEEE Xplore]
* [https://arxiv.org/abs/1901.01880 Arxiv mirror][http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.html CVF Mirror][https://ieeexplore.ieee.org/document/9008541 IEEE Xplore]
* [http://openaccess.thecvf.com/content_ICCV_2019/supplemental/Chen_Monocular_Neural_Image_ICCV_2019_supplemental.pdf Supp]
* [https://github.com/xuchen-ethz/continuous_view_synthesis Github]
* [https://github.com/xuchen-ethz/continuous_view_synthesis Github]



Revision as of 12:58, 1 June 2020

Monocular Neural Image Based Rendering with Continuous View Control

Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich


Method

The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.

  1. Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
  2. Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
  3. Decode the latent representation into a depth map for the target view \(D_t\).
  4. Compute correspondences between source and target using projection to the depth map
    • Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
  5. Do warping using correspondences to get the target image \(\hat{I}_{t}\).

In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]

where:

  • \(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
  • \(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)

Transforming Auto-encoder

The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\): \[ \begin{equation} z_t = [R|t]_{s\to t} \tilde{z}_s \end{equation} \]

Depth Guided Appearance Mapping

Architecture

The only neural network they use is a transforming autoencoder.


Evaluation

References