Essential Matrix: Difference between revisions

Latest revision as of 18:59, 6 May 2020

An essential matrix, denoted \(\displaystyle \mathbf{E}\), is a \(\displaystyle 3 \times 3\) matrix relating camera parameters.
You can compute the essential matrix based on features matches between two images.
Using the essential matrix, you can extract the relative rotation and translation between two cameras.

Given feature points \(\displaystyle \mathbf{x}\) and \(\displaystyle \mathbf{x'}\) from two images, the essential matrix satisfies the equation \(\displaystyle \mathbf{x}'^T \mathbf{E} \mathbf{x} = 0\)

Much of this is from An Investigation of the Essential Matrix by Richard Hartley^[1].

Background and Derivation

A pinhole camera with \(\displaystyle 3 \times 4\) projection matrix \(\displaystyle P = K(R \mid -RT)\) takes points \(\displaystyle \mathbf{x} = (x, y, z)^T\) and projects them to \(\displaystyle \mathbf{u} = (u, v, w)^T = \mathbf{R}(\mathbf{x} - \mathbf{t})\). Here, the notation \(\displaystyle (R \mid -RT)\) represents a \(\displaystyle 3 \times 3\) matrix \(\displaystyle R\) concatenated with a \(\displaystyle 3 \times 1\) matrix \(\displaystyle -RT\) to form a \(\displaystyle 3 \times 4\) matrix.

We now consider two cameras: Camera 1 is at the origin of world space (or it's object space) \(\displaystyle P = (I | 0)\). Camera 2 is displaced with some rotation \(\displaystyle R\) and translation \(\displaystyle -RT\), \(\displaystyle P' = (R | -RT)\).
Any point \(\displaystyle \mathbf{u} = (u,v,w)^T\) in camera 1 is represented by an epipolar line in camera 2.
Under camera 2, the position of camera 1 is \(\displaystyle -RT\) and \(\displaystyle P' (u,v,w,0)^T = R\mathbf{u}\) is somewhere on this epipolar line. Thus the line can be calculated by taking the cross product between the camera origin and \(\displaystyle \mathbf{u}\). \[ (p,q,r)^T = RT \times R\mathbf{u} = R(T \times \mathbf{u}) = R[T]_{\times} \mathbf{u} \] Now the line is represented by \(\displaystyle \{(u',v',w') \mid pu' + qv' + rw' = 0\}\), i.e. all points orthogonal to \(\displaystyle (p,q,r)^T\).

Given a vector \(\displaystyle \mathbf{t}\), the matrix form of its cross product is:
\(\displaystyle [\mathbf{t}]_{\times} = \begin{bmatrix} 0 & -t_z & t_y\\ t_z & 0 & -t_x\\ -t_y & t_x & 0 \end{bmatrix} \)

\(\displaystyle [\mathbf{t}]_{\times} \mathbf{u} = \mathbf{t} \times \mathbf{u}\)
This matrix is skew-symmetric. I.e. \(\displaystyle [\mathbf{t}]^T_{\times} = -[\mathbf{t}]_{\times}\)

Now if \(\displaystyle \mathbf{u}'\) is a feature point from camera 2 matching point \(\displaystyle \mathbf{u}\) from camera 1, then it must lie on this epipolar line.
Thus \(\displaystyle \mathbf{u}' \in \{(u',v',w') \mid pu' + qv' + rw' = 0\} \implies \mathbf{u}'^T R[T]_{\times} \mathbf{u} = 0\). Now \(\displaystyle Q = R[T]_{\times}\) is the essential matrix.

Given 8 or more correspondence points between camera 1 and camera 2, you can solve for \(\displaystyle Q\) using the Wikipedia: Eight-point algorithm

Properties

A \(\displaystyle 3 \times 3\) matrix is an essential matrix iff two of its singular values are equal and the third value is \(\displaystyle 0\)

Proof

See Bartoli and Olsen^[2].

The essential matrix is defined only up to a scale. I.e. if \(\displaystyle \mathbf{u}'^T Q \mathbf{u} = 0\) then \(\displaystyle \mathbf{u}'^T (\lambda Q) \mathbf{u} = 0\)
- To extract scale, you need to have an object of known size or know the distance between the cameras.

Calculating the Essential Matrix from two images

Copied from section 5 of Hartley^[1].

Here we will focus on calculating the essential matrix given 8 or more points. It is possible to calculate the essential matrix using 7 points using a non-linear equation if your correspondences are very accurate and are not linearly dependent. For this, see section 5.1 of Hartley^[1].

We assume we have list of correspondences between the two images, \(\displaystyle \{\mathbf{u}_i\}\) and \(\displaystyle \{\mathbf{u}_i'\}\).
This can be build by extracting features (e.g. ORB, SIFT, SURF) and creating matches.
Each feature is of the form \(\displaystyle \mathbf{u}_i = (u, v, 1)\).
Here, \(\displaystyle u \in [0, W]\) and \(\displaystyle v \in [0, H]\) are pixel positions within the image. Ideally, each set of correspondences should be independently normalized such that the origin of each set is the centroid of all points and the mean distance is \(\displaystyle \sqrt{2}\). This can be done with a single matrix for each set of points.

For each correspondence \(\displaystyle \mathbf{u}_i\) and \(\displaystyle \mathbf{u}'_i\)., we get the equation \(\displaystyle \mathbf{u}_i'^T Q \mathbf{u}_i = 0\).

This system of equations is linear in the entries of \(\displaystyle Q\) and can be rewritten as \(\displaystyle A\mathbf{x} = 0\) where \(\displaystyle \mathbf{x}\) contains the entries of \(\displaystyle Q\).
Here, \(\displaystyle \mathbf{x} = \begin{pmatrix} q_{11} \\ q_{12} \\ q_{13} \\ q_{21} \\ q_{22} \\ q_{23} \\ q_{31} \\ q_{32} \\ q_{33} \\ \end{pmatrix}\) and each row of \(\displaystyle A\) is \(\displaystyle \mathbf{a_i} = \begin{pmatrix} u'_1 u_1 \\ u'_1 u_2 \\ u'_1 \\ u'_2 u_1 \\ u'_2 u_2 \\ u'_2 \\ u'_3 u_1 \\ u'_3 u_2 \\ 1 \\ \end{pmatrix} \)
Here \(\displaystyle A\) is an \(\displaystyle n \times 9\) matrix (where \(\displaystyle n=8\) if using 8 points).

The goal is to minimize \(\displaystyle \Vert A\mathbf{x} \Vert \) such that \(\displaystyle \Vert \mathbf{x} \Vert = 1\)

Solution

First take the SVD of A: \(\displaystyle A = UDV^T\)
- \(\displaystyle U\) is \(\displaystyle 8 \times 8\), \(\displaystyle D\) is \(\displaystyle 8 \times 9\) diagonal matrix, and \(\displaystyle V^T\) is a \(\displaystyle 9 \times 9\) matrix.
Now \(\displaystyle x = V_j\), the \(\displaystyle j\)-th column of \(\displaystyle V\). Reshape this to get \(\displaystyle Q_{est}\).
In practice, this may not be rank 2 so we take the another SVD \(\displaystyle Q_{est}=U diag(r,s,t) V^T\)
Zero out the third singular value to get a final estimate
\(\displaystyle Q' = U diag(r,s,0) V^T\)

Determining rotation \(\displaystyle \mathbf{R}\) and translation \(\displaystyle \mathbf{t}\)

Copied from section 3 or Hartley^[1].

Theorem

A \(\displaystyle 3 \times 3\) real matrix can be factored into a product of a rotation matrix \(\displaystyle R\) and a non-zero skew symmetric matrix \(\displaystyle S\) iff \(\displaystyle Q\) is two equal non-zero singular values and one zero singular value.

Let the singular value decomposition of our essential matrix \(\displaystyle Q\) be \(\displaystyle U D V^T\) where \(\displaystyle D = \operatorname{diag}(k, k, 0)\).
Let \(\displaystyle E = \begin{pmatrix} 0 & 1 & 0\\ -1 & 0 & 0\\ 0 & 0 & 1 \end{pmatrix}\) and \(\displaystyle Z = \begin{pmatrix} 0 & -1 & 0\\ 1 & 0 & 0 \\ 0 & 0 & 0 \end{pmatrix}\)
Then we have the following:

\(\displaystyle S = V Z V^T\) or \(\displaystyle -V Z V^T\)
\(\displaystyle R = U E V^T\) or \(\displaystyle U E^T V^T\)
\(\displaystyle Q = RS\)

Notes

Here, \(\displaystyle R\) is your rotation and \(\displaystyle S = [T]_{\times}\)
\(\displaystyle T = V (0, 0, 1)^T\), the third column of \(\displaystyle V\) or third row of \(\displaystyle V^T\)
- Note that this only gives you the direction of the translation. The magnitude is not determined.
Some sources such as Wikipedia use \(\displaystyle [T]_{\times} = U Z U^T\) and \(\displaystyle T = U (0, 0, 1)^T\).
- This is equivalent to \(\displaystyle RT\) in our notation.
Since \(\displaystyle V\) is orthonormal, this gives \(\displaystyle \Vert T \Vert = 1\)
Note \(\displaystyle -RT = -UEV^T V(0,0,1)^T = -U(0,0,1)^T\)

We have two possibilities for \(\displaystyle R\) and \(\displaystyle T\):

\(\displaystyle R = U E V^T\) or \(\displaystyle U E^T V^T\)
\(\displaystyle T = V (0, 0, 1)^T\) or \(\displaystyle -V (0, 0, 1)^T\)

This gives us 4 possibilities for \(\displaystyle P'\)

\(\displaystyle P' = (UEV^T \mid -U(0,0,1)^T\)
\(\displaystyle P' = (UEV^T \mid U(0,0,1)^T\)
\(\displaystyle P' = (UE^TV^T \mid -U(0,0,1)^T\)
\(\displaystyle P' = (UE^TV^T \mid U(0,0,1)^T\)

For planar images, only one of these 4 options is feasible. You can determine which one is feasibly using triangulation with one of your points.

3D points

See Wikipedia: Essential_matrix

Fundamental Matrix

The fundamental matrix is a generalization of the essential matrix which also takes into account the calibration of the camera.

Resources

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 An Investigation of the Essential Matrix by Richard Hartley
↑ 3D Computer Vision Lecture 16 by Bartoli and Olsen

[hartley-1] 1.0 ^1.1 ^1.2 ^1.3 An Investigation of the Essential Matrix by Richard Hartley

[lecture16-2] 3D Computer Vision Lecture 16 by Bartoli and Olsen

[1]

[2]

@@ Line 6: / Line 6: @@
 the essential matrix satisfies the equation <math>\mathbf{x}'^T \mathbf{E} \mathbf{x} = 0</math>
+Much of this is from [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.7518 An Investigation of the Essential Matrix] by Richard Hartley<ref name="hartley">[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.7518 An Investigation of the Essential Matrix] by Richard Hartley</ref>.
 ==Background and Derivation==
 [[File: Epipolar_geometry.svg | link=Wikipedia | thumb | 400px | [[Wikipedia: Epipolar Geometriy]] ]]
-Much of this section is from <ref name="hartley">[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.7518 An Investigation of the Essential Matrix] by Richard Hartley</ref>
+A pinhole camera with <math>3 \times 4</math> projection matrix <math>P = K(R \mid -RT)</math> takes points <math>\mathbf{x} = (x, y, z)^T</math> and projects them to <math>\mathbf{u} = (u, v, w)^T = \mathbf{R}(\mathbf{x} - \mathbf{t})</math>. Here, the notation <math>(R \mid -RT)</math> represents a <math>3 \times 3</math> matrix <math>R</math> concatenated with a <math>3 \times 1</math> matrix <math>-RT</math> to form a <math>3 \times 4</math> matrix.
-A pinhole camera with <math>3 \times 4</math> projection matrix <math>P = K(R | -RT)</math> takes points <math>\mathbf{x} = (x, y, z)^T</math> and projects them to <math>\mathbf{u} = (u, v, w)^T = \mathbf{R}(\mathbf{x} - \mathbf{t})</math>.
 We now consider two cameras:
 Camera 1 is at the origin of world space (or it's object space) <math>P = (I | 0)</math>.
-Camera 2 is displaced with some rotation <math>R</math> and translation <math>R</math>, <math>P' = (R | -RT)</math>.<br>
+Camera 2 is displaced with some rotation <math>R</math> and translation <math>-RT</math>, <math>P' = (R | -RT)</math>.<br>
 Any point <math>\mathbf{u} = (u,v,w)^T</math> in camera 1 is represented by an epipolar line in camera 2.<br>
 Under camera 2, the position of camera 1 is <math>-RT</math> and <math>P' (u,v,w,0)^T = R\mathbf{u}</math> is somewhere on this epipolar line.
@@ Line 37: / Line 36: @@
 * This matrix is skew-symmetric. I.e. <math>[\mathbf{t}]^T_{\times} = -[\mathbf{t}]_{\times}</math>
-Now if <math>\mathbf{u}'</math> is a feature point from camera 2 matching point <math>\mathbf{u}</math>, then it must lie on this epipolar line.
+Now if <math>\mathbf{u}'</math> is a feature point from camera 2 matching point <math>\mathbf{u}</math> from camera 1, then it must lie on this epipolar line.<br>
 Thus <math>\mathbf{u}' \in \{(u',v',w') \mid pu' + qv' + rw' = 0\} \implies \mathbf{u}'^T R[T]_{\times} \mathbf{u} = 0</math>.
 Now <math>Q = R[T]_{\times}</math> is the essential matrix.
@@ Line 48: / Line 47: @@
 See Bartoli and Olsen<ref name="lecture16">[http://igt.ip.uca.fr/~ab/Classes/DIKU-3DCV2/Handouts/Lecture16.pdf 3D Computer Vision Lecture 16 by Bartoli and Olsen]</ref>.
 }}
+* The essential matrix is defined only up to a scale. I.e. if <math>\mathbf{u}'^T Q \mathbf{u} = 0</math> then <math>\mathbf{u}'^T (\lambda Q) \mathbf{u} = 0</math>
+** To extract scale, you need to have an object of known size or know the distance between the cameras.
 ==Calculating the Essential Matrix from two images==
-===Planar Images===
+{{main | Wikipedia:Eight-point algorithm}}
+Copied from section 5 of Hartley<ref name="hartley"/>.
+Here we will focus on calculating the essential matrix given 8 or more points.
+It is possible to calculate the essential matrix using 7 points using a non-linear equation if
+your correspondences are very accurate and are not linearly dependent.
+For this, see section 5.1 of Hartley<ref name="hartley"/>.
+We assume we have list of correspondences between the two images, <math>\{\mathbf{u}_i\}</math> and <math>\{\mathbf{u}_i'\}</math>.<br>
+This can be build by extracting features (e.g. ORB, SIFT, SURF) and creating matches.<br>
+Each feature is of the form <math>\mathbf{u}_i = (u, v, 1)</math>.
+Here, <math>u \in [0, W]</math> and <math>v \in [0, H]</math> are pixel positions within the image.
+Ideally, each set of correspondences should be independently normalized such that the origin of each set is the centroid of all points and the mean distance is <math>\sqrt{2}</math>.
+This can be done with a single matrix for each set of points.
-===Spherical Images===
+For each correspondence <math>\mathbf{u}_i</math> and <math>\mathbf{u}'_i</math>.,
-Here we assume an equirectangular projection.
+we get the equation <math>\mathbf{u}_i'^T Q \mathbf{u}_i = 0</math>.
+This system of equations is linear in the entries of <math>Q</math> and can be
+rewritten as <math>A\mathbf{x} = 0</math> where <math>\mathbf{x}</math> contains the entries of <math>Q</math>.<br>
+Here,
+<math>\mathbf{x} = \begin{pmatrix}
+q_{11} \\ q_{12} \\ q_{13} \\
+q_{21} \\ q_{22} \\ q_{23} \\
+q_{31} \\ q_{32} \\ q_{33} \\
+\end{pmatrix}</math>
+and each row of <math>A</math> is
+<math>\mathbf{a_i} = \begin{pmatrix}
+u'_1 u_1 \\ u'_1 u_2 \\ u'_1 \\
+u'_2 u_1 \\ u'_2 u_2 \\ u'_2 \\
+u'_3 u_1 \\ u'_3 u_2 \\ 1 \\
+\end{pmatrix}
+</math><br>
+Here <math>A</math> is an <math>n \times 9</math> matrix (where <math>n=8</math> if using 8 points).
+The goal is to minimize <math>\Vert A\mathbf{x} \Vert </math> such that <math>\Vert \mathbf{x} \Vert = 1</math>
+;Solution
+* First take the SVD of A: <math>A = UDV^T</math>
+** <math>U</math> is <math>8 \times 8</math>, <math>D</math> is <math>8 \times 9</math> diagonal matrix, and <math>V^T</math> is a <math>9 \times 9</math> matrix.
+* Now <math>x = V_j</math>, the <math>j</math>-th column of <math>V</math>. Reshape this to get <math>Q_{est}</math>.
+* In practice, this may not be rank 2 so we take the another SVD <math>Q_{est}=U diag(r,s,t) V^T</math>
+* Zero out the third singular value to get a final estimate
+*: <math>Q' = U diag(r,s,0) V^T</math>
 ==Determining rotation <math>\mathbf{R}</math> and translation <math>\mathbf{t}</math>==
+Copied from section 3 or Hartley<ref name="hartley"/>.
+;Theorem
+A <math>3 \times 3</math> real matrix can be factored into a product of a rotation matrix <math>R</math> and a non-zero skew symmetric matrix <math>S</math> iff <math>Q</math> is two equal non-zero singular values and one zero singular value.
+Let the singular value decomposition of our essential matrix <math>Q</math> be <math>U D V^T</math> where <math>D = \operatorname{diag}(k, k, 0)</math>.<br>
+Let <math>E = \begin{pmatrix}
+& 1 & 0\\
+-1 & 0 & 0\\
+& 0 & 1
+\end{pmatrix}</math>
+and
+<math>Z = \begin{pmatrix}
+& -1 & 0\\
+& 0 & 0 \\
+& 0 & 0
+\end{pmatrix}</math><br>
+Then we have the following:
+* <math>S = V Z V^T</math> or <math>-V Z V^T</math>
+* <math>R = U E V^T</math> or <math>U E^T V^T</math>
+* <math>Q = RS</math>
+;Notes
+* Here, <math>R</math> is your rotation and <math>S = [T]_{\times}</math>
+* <math>T = V (0, 0, 1)^T</math>, the third column of <math>V</math> or third row of <math>V^T</math>
+** Note that this only gives you the direction of the translation. The magnitude is not determined.
+* Some sources such as Wikipedia use <math>[T]_{\times} = U Z U^T</math> and <math>T = U (0, 0, 1)^T</math>.
+** This is equivalent to <math>RT</math> in our notation.
+* Since <math>V</math> is orthonormal, this gives <math>\Vert T \Vert = 1</math>
+* Note <math>-RT = -UEV^T V(0,0,1)^T = -U(0,0,1)^T</math>
+We have two possibilities for <math>R</math> and <math>T</math>:
+* <math>R = U E V^T</math> or <math>U E^T V^T</math>
+* <math>T = V (0, 0, 1)^T</math> or <math>-V (0, 0, 1)^T</math>
+This gives us 4 possibilities for <math>P'</math>
+* <math>P' = (UEV^T \mid -U(0,0,1)^T</math>
+* <math>P' = (UEV^T \mid U(0,0,1)^T</math>
+* <math>P' = (UE^TV^T \mid -U(0,0,1)^T</math>
+* <math>P' = (UE^TV^T \mid U(0,0,1)^T</math>
+For planar images, only one of these 4 options is feasible.
+You can determine which one is feasibly using triangulation with one of your points.
 ==3D points==
 See [[Wikipedia: Essential_matrix]]
+==Fundamental Matrix==
+The fundamental matrix is a generalization of the essential matrix which also takes into account the calibration of the camera.
 ==Resources==
 * [[Wikipedia: Essential_matrix]]
+* [http://robotics.stanford.edu/~birch/projective/node20.html stanford essential and fundamental matricies]
+* [https://github.com/darknight1900/books/blob/master/Multiple%20View%20Geometry%20in%20Computer%20Vision%20(Second%20Edition).pdf Multiple View Geometry in Computer Vision by Hartley and Zisserman]
 ==References==