Monocular Neural Image Based Rendering with Continuous View Control: Difference between revisions

Revision as of 13:13, 1 June 2020

Monocular Neural Image Based Rendering with Continuous View Control (ICCV 2019)

Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich

Method

The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.

Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
Decode the latent representation into a depth map for the target view \(D_t\).
Compute correspondences between source and target using projection to the depth map
- Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
Do warping using correspondences to get the target image \(\hat{I}_{t}\).

In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]

where:

\(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
\(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)

Transforming Auto-encoder

The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\). \[ \begin{equation} z_t = T_{s \to t} \tilde{z}_s \end{equation} \]

Depth Guided Appearance Mapping

Architecture

The only neural network they use is a transforming autoencoder.
Details about their network are provided in the supplementary details as well as in the code.
Their implementation is based on Zhou et al. View Synthesis by Appearance Flow^[1].

The encoder converts images into latent points.
It consists of 7 convolutional blocks which each downsample the feature map.
Each block is: Conv-BatchNorm-LeakyReLU.
The output of the convolutional blocks are put through a fully connected layer and reshaped into a \(200 \times 3\) matrix.

The decoder renders the latent points into a depth map from the target view.
It consists of 7 blocks of: Upsample-Conv-BatchNorm-LeakyReLU.
They use bilinear upsampling.

Evaluation

Their evaluation is performed on ShapeNet and KITTI.

References

↑ Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[zhou2016view-1] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557

[1]

@@ Line 1: / Line 1: @@
-Monocular Neural Image Based Rendering with Continuous View Control
+Monocular Neural Image Based Rendering with Continuous View Control (ICCV 2019)
 Authors: Xu Chen, Jie Song, Otmar Hilliges
 Affiliations: AIT Lab, ETH Zurich
-* [https://arxiv.org/abs/1901.01880 Arxiv mirror] [http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.html CVF Mirror] [https://ieeexplore.ieee.org/document/9008541 IEEE Xplore]
+*[https://arxiv.org/abs/1901.01880 Arxiv mirror] [http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.html CVF Mirror] [https://ieeexplore.ieee.org/document/9008541 IEEE Xplore]
-* [http://openaccess.thecvf.com/content_ICCV_2019/supplemental/Chen_Monocular_Neural_Image_ICCV_2019_supplemental.pdf Supp]
+*[http://openaccess.thecvf.com/content_ICCV_2019/supplemental/Chen_Monocular_Neural_Image_ICCV_2019_supplemental.pdf Supp]
-* [https://github.com/xuchen-ethz/continuous_view_synthesis Github]
+*[https://github.com/xuchen-ethz/continuous_view_synthesis Github]
@@ Line 14: / Line 14: @@
 The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.
-# Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
+#<nowiki>Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).</nowiki>
-# Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
+#Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
-# Decode the latent representation into a depth map for the target view \(D_t\).
+#Decode the latent representation into a depth map for the target view \(D_t\).
-# Compute correspondences between source and target using projection to the depth map
+#Compute correspondences between source and target using projection to the depth map
-#* Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
+#*Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
-# Do warping using correspondences to get the target image \(\hat{I}_{t}\).
+#Do warping using correspondences to get the target image \(\hat{I}_{t}\).
-In total, the mapping is:
+<nowiki>In total, the mapping is:
 \[
 \begin{equation}
 M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t}
 \end{equation}
-\]
+\]</nowiki>
 where:
-* \(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
-* \(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)
+*\(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
+*\(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)
 ===Transforming Auto-encoder===
@@ Line 47: / Line 48: @@
 The only neural network they use is a transforming autoencoder.
 Details about their network are provided in the supplementary details as well as in the code.
-Their implementation is based on Zhou et al. [https://github.com/tinghuiz/appearance-flow View Synthesis by Appearance Flow]<ref name="zhou2016view">Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: [https://doi.org/10.1007/978-3-319-46493-0_18 10.1007/978-3-319-46493-0_18] Arxiv Mirror [https://arxiv.org/abs/1605.03557 https://arxiv.org/abs/1605.03557]</ref>.
+Their implementation is based on Zhou et al. [https://github.com/tinghuiz/appearance-flow View Synthesis by Appearance Flow]<ref name="zhou2016view">Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: [https://doi.org/10.1007/978-3-319-46493-0_18 10.1007/978-3-319-46493-0_18] Arxiv Mirror https://arxiv.org/abs/1605.03557</ref>.
 The encoder converts images into latent points.
@@ Line 62: / Line 63: @@
 ==References==
+<references />