Monocular Neural Image Based Rendering with Continuous View Control: Difference between revisions
No edit summary Tag: visualeditor |
|||
Line 1: | Line 1: | ||
Monocular Neural Image Based Rendering with Continuous View Control | Monocular Neural Image Based Rendering with Continuous View Control (ICCV 2019) | ||
Authors: Xu Chen, Jie Song, Otmar Hilliges | Authors: Xu Chen, Jie Song, Otmar Hilliges | ||
Affiliations: AIT Lab, ETH Zurich | Affiliations: AIT Lab, ETH Zurich | ||
* [https://arxiv.org/abs/1901.01880 Arxiv mirror] [http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.html CVF Mirror] [https://ieeexplore.ieee.org/document/9008541 IEEE Xplore] | *[https://arxiv.org/abs/1901.01880 Arxiv mirror] [http://openaccess.thecvf.com/content_ICCV_2019/html/Chen_Monocular_Neural_Image_Based_Rendering_With_Continuous_View_Control_ICCV_2019_paper.html CVF Mirror] [https://ieeexplore.ieee.org/document/9008541 IEEE Xplore] | ||
* [http://openaccess.thecvf.com/content_ICCV_2019/supplemental/Chen_Monocular_Neural_Image_ICCV_2019_supplemental.pdf Supp] | *[http://openaccess.thecvf.com/content_ICCV_2019/supplemental/Chen_Monocular_Neural_Image_ICCV_2019_supplemental.pdf Supp] | ||
* [https://github.com/xuchen-ethz/continuous_view_synthesis Github] | *[https://github.com/xuchen-ethz/continuous_view_synthesis Github] | ||
Line 14: | Line 14: | ||
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image. | The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image. | ||
# Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\). | #<nowiki>Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).</nowiki> | ||
# Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\). | #Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\). | ||
# Decode the latent representation into a depth map for the target view \(D_t\). | #Decode the latent representation into a depth map for the target view \(D_t\). | ||
# Compute correspondences between source and target using projection to the depth map | #Compute correspondences between source and target using projection to the depth map | ||
#* Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\). | #*Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\). | ||
# Do warping using correspondences to get the target image \(\hat{I}_{t}\). | #Do warping using correspondences to get the target image \(\hat{I}_{t}\). | ||
In total, the mapping is: | <nowiki>In total, the mapping is: | ||
\[ | \[ | ||
\begin{equation} | \begin{equation} | ||
M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} | M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} | ||
\end{equation} | \end{equation} | ||
\] | \]</nowiki> | ||
where: | where: | ||
* \(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\) | |||
* \(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\) | *\(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\) | ||
*\(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\) | |||
===Transforming Auto-encoder=== | ===Transforming Auto-encoder=== | ||
Line 47: | Line 48: | ||
The only neural network they use is a transforming autoencoder. | The only neural network they use is a transforming autoencoder. | ||
Details about their network are provided in the supplementary details as well as in the code. | Details about their network are provided in the supplementary details as well as in the code. | ||
Their implementation is based on Zhou et al. [https://github.com/tinghuiz/appearance-flow View Synthesis by Appearance Flow]<ref name="zhou2016view">Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: [https://doi.org/10.1007/978-3-319-46493-0_18 10.1007/978-3-319-46493-0_18] Arxiv Mirror | Their implementation is based on Zhou et al. [https://github.com/tinghuiz/appearance-flow View Synthesis by Appearance Flow]<ref name="zhou2016view">Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: [https://doi.org/10.1007/978-3-319-46493-0_18 10.1007/978-3-319-46493-0_18] Arxiv Mirror https://arxiv.org/abs/1605.03557</ref>. | ||
The encoder converts images into latent points. | The encoder converts images into latent points. | ||
Line 62: | Line 63: | ||
==References== | ==References== | ||
<references /> |
Revision as of 13:13, 1 June 2020
Monocular Neural Image Based Rendering with Continuous View Control (ICCV 2019)
Authors: Xu Chen, Jie Song, Otmar Hilliges
Affiliations: AIT Lab, ETH Zurich
Method
The main idea is to create a transformating autoencoder.
The goal of the transforming autoencoder is to create a point cloud of latent features from a 2D source image.
- Encode the image \(I_s\) into a latent representation \(z = E_{\theta_{e}}(I_s)\).
- Rotate and translate the latent representation to get \(z_{T} = T_{s \to t}(z)\).
- Decode the latent representation into a depth map for the target view \(D_t\).
- Compute correspondences between source and target using projection to the depth map
- Uses camera intrinsics \(K\) and extrinsics \(T_{s\to t}\) to yield a dense backward flow map \(C_{t \to s}\).
- Do warping using correspondences to get the target image \(\hat{I}_{t}\).
In total, the mapping is: \[ \begin{equation} M(I_s) = B(P_{t \to s}(D_{\theta_{d}}(T_{s \to t}(E_{\theta_{e}}(T_s))),I_s) = \hat{I}_{t} \end{equation} \]
where:
- \(B(F, I)\) is a bilinear warp of image \(I\) using backwards flow \(F\)
- \(P_{t \to s}(I)\)) is the projection of \(I\) from \(t\) to \(s\)
Transforming Auto-encoder
The latent code \(z_s\) is represented as a set of 3D points: \(z_s \in \mathbb{R}^{n \times 3}\).
This allows it in homogeneous coordinates to be multiplied by the transformation matrix \(T_{s \to t} = [R |t]_{s \to t}\).
\[
\begin{equation}
z_t = T_{s \to t} \tilde{z}_s
\end{equation}
\]
Depth Guided Appearance Mapping
Architecture
The only neural network they use is a transforming autoencoder.
Details about their network are provided in the supplementary details as well as in the code.
Their implementation is based on Zhou et al. View Synthesis by Appearance Flow[1].
The encoder converts images into latent points.
It consists of 7 convolutional blocks which each downsample the feature map.
Each block is: Conv-BatchNorm-LeakyReLU.
The output of the convolutional blocks are put through a fully connected layer and reshaped into a \(200 \times 3\) matrix.
The decoder renders the latent points into a depth map from the target view.
It consists of 7 blocks of: Upsample-Conv-BatchNorm-LeakyReLU.
They use bilinear upsampling.
Evaluation
Their evaluation is performed on ShapeNet and KITTI.
References
- ↑ Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alyosha Efros. (2016). View Synthesis by Appearance Flow (ECCV 2016) DOI: 10.1007/978-3-319-46493-0_18 Arxiv Mirror https://arxiv.org/abs/1605.03557