\( \newcommand{\P}[]{\unicode{xB6}} \newcommand{\AA}[]{\unicode{x212B}} \newcommand{\empty}[]{\emptyset} \newcommand{\O}[]{\emptyset} \newcommand{\Alpha}[]{Α} \newcommand{\Beta}[]{Β} \newcommand{\Epsilon}[]{Ε} \newcommand{\Iota}[]{Ι} \newcommand{\Kappa}[]{Κ} \newcommand{\Rho}[]{Ρ} \newcommand{\Tau}[]{Τ} \newcommand{\Zeta}[]{Ζ} \newcommand{\Mu}[]{\unicode{x039C}} \newcommand{\Chi}[]{Χ} \newcommand{\Eta}[]{\unicode{x0397}} \newcommand{\Nu}[]{\unicode{x039D}} \newcommand{\Omicron}[]{\unicode{x039F}} \DeclareMathOperator{\sgn}{sgn} \def\oiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x222F}\,}{\unicode{x222F}}{\unicode{x222F}}{\unicode{x222F}}}\,}\nolimits} \def\oiiint{\mathop{\vcenter{\mathchoice{\huge\unicode{x2230}\,}{\unicode{x2230}}{\unicode{x2230}}{\unicode{x2230}}}\,}\nolimits} \)

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
Authors: Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, Jan Kautz
UMass Amherst, NVIDIA, UC Merced

Method

We're given two images, \(I_0\) and \(I_1\).
The goal is to predict an intermediate image \(I_t\).

First, estimate the optical flow \(F_{0\to 1}\) and \(F_{1 \to 0}\). This is done using an optical flow neural network.
Then given these two, we can estimate the optical flow from the intermediate frame as follows:

  • \(\displaystyle \hat{F}_{t\to 0} = -(1-t)t F_{0 \to 1} + t^2 F_{1 \to 0}\)
  • \(\displaystyle \hat{F}_{t \to 1} = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}\)
Derivation

We consider estimating \(F_{t \to 1}(p)\).
The time distance from t to 1 is (1-t).
Thus if we estimate using \(F_{0 \to 1}\), our estimate is \(\displaystyle \frac{1-t}{1}F_{0 \to 1}(p)\).
On the other hand, if we estimate using \(F_{0 \to 1}\), the flow is backwards so our estimate is \(\displaystyle -\frac{1-t}{1}F_{1 \to 0}\).
When \(t=0\), we want our estimate to be identical to \(F_{0 \to 1}\). When \(t=1\), we want our estimate to be equal to \(F_{0 \to 1}\).
Thus our final estimate is \(\hat{F}_{t \to 1} = (1-t)(1-t)F_{0 \to 1} + (t)(1(1-t))F_{1 \to 0) = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}\).

Derivation for \(\hat{F}_{t \to 0}\) is the same except the distance to \(t=0\) is \(-t\).

The estimate of the intermediate frame now is:

  • \(\displaystyle \hat{I}_t = \alpha_0 \odot g(I_0, F_{t \to 0}) + (1 - \alpha_0) \odot g(I_1, F_{t \to 1})\)

where \(g\) is a differentiable backward warping function (bilinear interpolation) and \(\alpha_0\) controls the pixelwise contribution from each image.
A naive estimate would use \(\alpha_0 = (1-t)\). However, to address occlusions, it is necessary to find the visibility maps.

Visibility maps \(V_{t \leftarrow 0}\) and \(V_{t \leftarrow 1}\) tell us whether each pixel in \(I_t\) is visible in \(I_0\) and \(I_1\) respectively.
These visibility maps are estimated using a flow-interpolation neural network.
During training, they enforce \(V_{t \leftarrow 0} + V_{t \leftarrow 1} = 1\).

The final images estimate is:

  • \(\displaystyle \hat{I}_t = \frac{1}{Z} \odot \left( (1-t)V_{t \leftarrow 1} \odot g(I_0, F_{t \to 0}) + tV_{t \leftarrow 1} \odot g(I_1, F_{t \to 1}) \right)\)

Architecture