Super SloMo: Difference between revisions
Created page with " Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation Authors: Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned..." |
No edit summary Tag: visualeditor |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation | ||
* [http://openaccess.thecvf.com/content_cvpr_2018/html/Jiang_Super_SloMo_High_CVPR_2018_paper.html CVPR 2018 Paper][https://arxiv.org/abs/1712.00080 Arxiv Mirror] | Authors: Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, Jan Kautz | ||
* [https://github.com/avinashpaliwal/Super-SloMo Unofficial implementation by Avinash Paliwal] | |||
Affiliations: UMass Amherst, NVIDIA, UC Merced | |||
*[http://openaccess.thecvf.com/content_cvpr_2018/html/Jiang_Super_SloMo_High_CVPR_2018_paper.html CVPR 2018 Paper][https://arxiv.org/abs/1712.00080 Arxiv Mirror] | |||
*[https://github.com/avinashpaliwal/Super-SloMo Unofficial implementation by Avinash Paliwal] | |||
==Method== | ==Method== | ||
We're given two images, \(I_0\) and \(I_1\). | |||
The goal is to predict an intermediate image \(I_t\). | |||
First, estimate the optical flow \(F_{0\to 1}\) and \(F_{1 \to 0}\). This is done using an optical flow neural network. | |||
Then given these two, we can estimate the optical flow from the intermediate frame as follows: | |||
*<math>\hat{F}_{t\to 0} = -(1-t)t F_{0 \to 1} + t^2 F_{1 \to 0}</math> | |||
*<math>\hat{F}_{t \to 1} = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}</math> | |||
{{hidden | Derivation | | |||
We consider estimating <math display="inline">F_{t \to 1}(p)</math>. | |||
The time distance from t to 1 is (1-t). | |||
Thus if we estimate using <math display="inline">F_{0 \to 1}</math>, our estimate is <math>\frac{1-t}{1}F_{0 \to 1}(p)</math>. | |||
On the other hand, if we estimate using <math display="inline">F_{0 \to 1}</math>, the flow is backwards so our estimate is <math>-\frac{1-t}{1}F_{1 \to 0}</math>. | |||
When <math display="inline">t=0</math>, we want our estimate to be identical to <math display="inline">F_{0 \to 1}</math>. When <math display="inline">t=1</math>, we want our estimate to be equal to <math display="inline">F_{0 \to 1}</math>. | |||
Thus our final estimate is <math display="inline">\hat{F}_{t \to 1} = (1-t)(1-t)F_{0 \to 1} + (t)(1(1-t))F_{1 \to 0) = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}</math>. | |||
Derivation for <math display="inline">\hat{F}_{t \to 0}</math> is the same except the distance to <math display="inline">t=0</math> is <math display="inline">-t</math>. | |||
}} | |||
The estimate of the intermediate frame now is: | |||
*<math>\hat{I}_t = \alpha_0 \odot g(I_0, F_{t \to 0}) + (1 - \alpha_0) \odot g(I_1, F_{t \to 1})</math> | |||
where \(g\) is a differentiable backward warping function (bilinear interpolation) and \(\alpha_0\) controls the pixelwise contribution from each image. | |||
A naive estimate would use \(\alpha_0 = (1-t)\). However, to address occlusions, it is necessary to find the visibility maps. | |||
Visibility maps \(V_{t \leftarrow 0}\) and \(V_{t \leftarrow 1}\) tell us whether each pixel in \(I_t\) is visible in \(I_0\) and \(I_1\) respectively. | |||
These visibility maps are estimated using a flow-interpolation neural network. | |||
During training, they enforce \(V_{t \leftarrow 0} + V_{t \leftarrow 1} = 1\). | |||
The final images estimate is: | |||
*<math>\hat{I}_t = \frac{1}{Z} \odot \left( (1-t)V_{t \leftarrow 1} \odot g(I_0, F_{t \to 0}) + tV_{t \leftarrow 1} \odot g(I_1, F_{t \to 1}) \right)</math> | |||
==Architecture== | ==Architecture== | ||
Their architecture consists of two similar CNNS, a flow computation CNN to compute the bidirectional flow between the two images and a flow interpolation CNN. | |||
Both networks are fully-convolution U-Net with 6 hierarchies in the encoder and 5 hierarchies in the decoder. | |||
===Flow Computation CNN=== | |||
===Flow Interpolation CNN=== | |||
==Resources== | |||
==References== |
Latest revision as of 17:11, 21 May 2020
Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
Authors: Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, Jan Kautz
Affiliations: UMass Amherst, NVIDIA, UC Merced
Method
We're given two images, \(I_0\) and \(I_1\).
The goal is to predict an intermediate image \(I_t\).
First, estimate the optical flow \(F_{0\to 1}\) and \(F_{1 \to 0}\). This is done using an optical flow neural network.
Then given these two, we can estimate the optical flow from the intermediate frame as follows:
- \(\displaystyle \hat{F}_{t\to 0} = -(1-t)t F_{0 \to 1} + t^2 F_{1 \to 0}\)
- \(\displaystyle \hat{F}_{t \to 1} = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}\)
We consider estimating \(F_{t \to 1}(p)\).
The time distance from t to 1 is (1-t).
Thus if we estimate using \(F_{0 \to 1}\), our estimate is \(\displaystyle \frac{1-t}{1}F_{0 \to 1}(p)\).
On the other hand, if we estimate using \(F_{0 \to 1}\), the flow is backwards so our estimate is \(\displaystyle -\frac{1-t}{1}F_{1 \to 0}\).
When \(t=0\), we want our estimate to be identical to \(F_{0 \to 1}\). When \(t=1\), we want our estimate to be equal to \(F_{0 \to 1}\).
Thus our final estimate is \(\hat{F}_{t \to 1} = (1-t)(1-t)F_{0 \to 1} + (t)(1(1-t))F_{1 \to 0) = (1-t)^2 F_{0 \to 1} - t(1-t)F_{1 \to 0}\).
Derivation for \(\hat{F}_{t \to 0}\) is the same except the distance to \(t=0\) is \(-t\).
The estimate of the intermediate frame now is:
- \(\displaystyle \hat{I}_t = \alpha_0 \odot g(I_0, F_{t \to 0}) + (1 - \alpha_0) \odot g(I_1, F_{t \to 1})\)
where \(g\) is a differentiable backward warping function (bilinear interpolation) and \(\alpha_0\) controls the pixelwise contribution from each image.
A naive estimate would use \(\alpha_0 = (1-t)\). However, to address occlusions, it is necessary to find the visibility maps.
Visibility maps \(V_{t \leftarrow 0}\) and \(V_{t \leftarrow 1}\) tell us whether each pixel in \(I_t\) is visible in \(I_0\) and \(I_1\) respectively.
These visibility maps are estimated using a flow-interpolation neural network.
During training, they enforce \(V_{t \leftarrow 0} + V_{t \leftarrow 1} = 1\).
The final images estimate is:
- \(\displaystyle \hat{I}_t = \frac{1}{Z} \odot \left( (1-t)V_{t \leftarrow 1} \odot g(I_0, F_{t \to 0}) + tV_{t \leftarrow 1} \odot g(I_1, F_{t \to 1}) \right)\)
Architecture
Their architecture consists of two similar CNNS, a flow computation CNN to compute the bidirectional flow between the two images and a flow interpolation CNN. Both networks are fully-convolution U-Net with 6 hierarchies in the encoder and 5 hierarchies in the decoder.