Space-time Neural Irradiance Fields for Free-Viewpoint Video
Space-time Neural Irradiance Fields for Free-Viewpoint Video
Authors: Wenqi Xian, Jia-Bin Huang, Johannes Kopf, Changil Kim Affiliations: Cornell, Virginia Tech, Facebook
The main contribution of this paper is the idea to use depth (either ground-truth or estimated) as an additional means of supervision, addressing motion-appearance ambiguity when storing videos in NeRF.
Method
- Inputs: Stream of RGBD images (i.e. an RGBD video) with camera matrices.
- Model: F maps a spatial-temporal location \(\displaystyle (\mathbf{x}, t)\) to a color \(\displaystyle \mathbf{c}\) and density \(\displaystyle \sigma\). \(\displaystyle F: (\mathbf{x}, t) \to (\mathbf{c}, \sigma)\)
Losses
- L2 RGB Loss
\(\displaystyle \sum \Vert \hat{C}(\mathbf{r}, t) - C(\mathbf{r}, t) \Vert_2^2\)
- Depth reconstruction loss
\(\displaystyle \sum \Vert \frac{1}{\hat{D}(\mathbf{r}, t)} - \frac{1}{D(\mathbf{r}, t)} \Vert_2^2\)
To estimate depth from volume rendering, they use \(\displaystyle \hat{D}(\mathbf{r}, t) = \int_{s_n}^{s_f}T(s,t) \sigma(\mathbf{r}(s),t)ds\)
where T is the accumulated transmittance up to the point s: \(\displaystyle T(s) = \exp \left( - \int_{s_n}^{s} \sigma(\mathbf{r}(p))dp \right)\)
- Empty space loss
\(\displaystyle L_{empty} = \sum \int_{s_n}^{d_t(u) - \epsilon} \sigma(\mathbf{r}(s), t) ds\).
This loss penalizes density before ray reaches the actual depth.
This is used to ensure the volume isn't too sparse since the depth loss only considers the mean of samples along the ray not the variance.
- Static scene loss
The goal is to ensure that occluded regions do not yield artifacts when viewing from a new perspective.
To achieve this, they just force the occluded regions, i.e. those beyond the ground-truth depth, to stay static over time.
\(\displaystyle L_{static} = \sum \Vert F(\mathbf{x}, t) - F(\mathbf{x}, t') \Vert_2^2\)
Architecture
Architecture is the same as the original NeRF.
Training
During training, they use \(\displaystyle N_r=1024\) rays randomly drawn w/o replacement for each iteration.
Rays are sampled uniformly in inverse depth.
Takes 50 hours to train with 100 frames at \(\displaystyle 960 \times 540\) on two V100s.
Evaluation
References
<templatestyles src="Reflist/styles.css" />