SfM-Net: Learning of Structure and Motion from Video: Difference between revisions

SfM-Net: Learning of Structure and Motion from Video (view source)

Revision as of 19:22, 2 June 2020

2,017 bytes added , 2 June 2020

no edit summary

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 10: / Line 10: @@
 ==Method==
+The first build two networks to estimate the following for time \(t\):
+* Image depth \(d_t \in [0,\infty)^{w \times h} \).
+* Camera rotation and translation: \(\{R_t^c, t_t^c \}\)
+* K motion masks: \(m_t^k\) for \(k=1,...,K\)
+* K object motions: \(\{R_t^k, t_t^k \}\)
+* Use the depth to generate a point cloud
+* Transform the point cloud based on object transformations
+* Transform the point cloud based on camera transformations
+* Compute optical flow and do warping
+* Repeat with \(I_{t+1}\) and \(I_{t}\) for backward consistency
+===Supervision/Loss Function===
+They apply several forms of supervision:
+* Self-supervision: Minimize distance between the reference and the warped image
+* Spatial smoothness priors: Penalize l1 norm of gradients on optical flow field, depth, and motion maps
+* Forward-backward consistency constraints: Do a run backwards in time and make sure the depth \(d_{t+1}\) is consistent with \(d_t\)
+* Supervising depth: Minimize estimate and ground truth depth
+* Supervising camera motion: Minimize estimate and ground truth camera motion
+* Supervising optical flow and object motion: Minimize estimate and ground truth optical flow and object motion on synthetic datasets
 ==Architecture==
+SfM-Net consists of two neural networks:
+* The motion network estimates camera motion, object motion, and object masks.
+* The structure network estimates depth what can be used to make a point cloud.
+Both networks follow a Conv-Deconv (U-Net) structure with skip connections.
+See the figure in the paper for more details
+===Motion Network===
+The inputs to the motion network are a pair of video frames \(I_t\) and \(I_{t+1}\) totaling a tensor with shape (\(380 \times 128 \times 6\).
+From this, the motion network predicts the following:
+* Camera rotation and translation: \(\{R_t^c, t_t^c \}\)
+* K motion masks: \(m_t^k\) for \(k=1,...,K\)
+* K object motions: \(\{R_t^k, t_t^k \}\)
+===Structure Network===
+The goal of the structure network is to estimate depth \(d_t \in [0,\infty)^{w \times h} \).
 ==References==