5,337
edits
(→Method) |
No edit summary |
||
Line 10: | Line 10: | ||
==Method== | ==Method== | ||
The first build two networks to estimate the following for time \(t\): | |||
* Image depth \(d_t \in [0,\infty)^{w \times h} \). | |||
* Camera rotation and translation: \(\{R_t^c, t_t^c \}\) | |||
* K motion masks: \(m_t^k\) for \(k=1,...,K\) | |||
* K object motions: \(\{R_t^k, t_t^k \}\) | |||
* Use the depth to generate a point cloud | |||
* Transform the point cloud based on object transformations | |||
* Transform the point cloud based on camera transformations | |||
* Compute optical flow and do warping | |||
* Repeat with \(I_{t+1}\) and \(I_{t}\) for backward consistency | |||
===Supervision/Loss Function=== | |||
They apply several forms of supervision: | |||
* Self-supervision: Minimize distance between the reference and the warped image | |||
* Spatial smoothness priors: Penalize l1 norm of gradients on optical flow field, depth, and motion maps | |||
* Forward-backward consistency constraints: Do a run backwards in time and make sure the depth \(d_{t+1}\) is consistent with \(d_t\) | |||
* Supervising depth: Minimize estimate and ground truth depth | |||
* Supervising camera motion: Minimize estimate and ground truth camera motion | |||
* Supervising optical flow and object motion: Minimize estimate and ground truth optical flow and object motion on synthetic datasets | |||
==Architecture== | ==Architecture== | ||
SfM-Net consists of two neural networks: | |||
* The motion network estimates camera motion, object motion, and object masks. | |||
* The structure network estimates depth what can be used to make a point cloud. | |||
Both networks follow a Conv-Deconv (U-Net) structure with skip connections. | |||
See the figure in the paper for more details | |||
===Motion Network=== | |||
The inputs to the motion network are a pair of video frames \(I_t\) and \(I_{t+1}\) totaling a tensor with shape (\(380 \times 128 \times 6\). | |||
From this, the motion network predicts the following: | |||
* Camera rotation and translation: \(\{R_t^c, t_t^c \}\) | |||
* K motion masks: \(m_t^k\) for \(k=1,...,K\) | |||
* K object motions: \(\{R_t^k, t_t^k \}\) | |||
===Structure Network=== | |||
The goal of the structure network is to estimate depth \(d_t \in [0,\infty)^{w \times h} \). | |||
==References== | ==References== |