Space-time Neural Irradiance Fields for Free-Viewpoint Video

Authors: Wenqi Xian, Jia-Bin Huang, Johannes Kopf, Changil Kim Affiliations: Cornell, Virginia Tech, Facebook

The main contribution of this paper is the idea to use depth (either ground-truth or estimated) as an additional means of supervision, addressing motion-appearance ambiguity when storing videos in NeRF.

Method

Inputs: Stream of RGBD images (i.e. an RGBD video) with camera matrices.
Model: F maps a spatial-temporal location \(\displaystyle (\mathbf{x}, t)\) to a color \(\displaystyle \mathbf{c}\) and density \(\displaystyle \sigma\). \(\displaystyle F: (\mathbf{x}, t) \to (\mathbf{c}, \sigma)\)

Losses

L2 RGB Loss

\(\displaystyle \sum \Vert \hat{C}(\mathbf{r}, t) - C(\mathbf{r}, t) \Vert_2^2\)

Depth reconstruction loss

\(\displaystyle \sum \Vert \frac{1}{\hat{D}(\mathbf{r}, t)} - \frac{1}{D(\mathbf{r}, t)} \Vert_2^2\)
To estimate depth from volume rendering, they use \(\displaystyle \hat{D}(\mathbf{r}, t) = \int_{s_n}^{s_f}T(s,t) \sigma(\mathbf{r}(s),t)ds\)
where T is the accumulated transmittance up to the point s: \(\displaystyle T(s) = \exp \left( - \int_{s_n}^{s} \sigma(\mathbf{r}(p))dp \right)\)

Empty space loss

\(\displaystyle L_{empty} = \sum \int_{s_n}^{d_t(u) - \epsilon} \sigma(\mathbf{r}(s), t) ds\).
This loss penalizes density before ray reaches the actual depth.
This is used to ensure the volume isn't too sparse since the depth loss only considers the mean of samples along the ray not the variance.

Static scene loss

The goal is to ensure that occluded regions do not yield artifacts when viewing from a new perspective.
To achieve this, they just force the occluded regions, i.e. those beyond the ground-truth depth, to stay static over time.
\(\displaystyle L_{static} = \sum \Vert F(\mathbf{x}, t) - F(\mathbf{x}, t') \Vert_2^2\)

Architecture

Architecture is the same as the original NeRF.

Training

During training, they use \(\displaystyle N_r=1024\) rays randomly drawn w/o replacement for each iteration.
Rays are sampled uniformly in inverse depth.
Takes 50 hours to train with 100 frames at \(\displaystyle 960 \times 540\) on two V100s.

Evaluation

References