Geometric Computer Vision: Difference between revisions
m David moved page Private:Geometric Computer Vision to Geometric Computer Vision over redirect |
|||
(32 intermediate revisions by the same user not shown) | |||
Line 158: | Line 158: | ||
# Warp the two images such that the epipolar lines become horizontal. | # Warp the two images such that the epipolar lines become horizontal. | ||
# This is called rectification. | # This is called rectification. | ||
The ''epipoles'' are where one camera sees the other. | |||
===Rectification=== | ===Rectification=== | ||
# Consider the left camera to be the center of a coordinate system. | # Consider the left camera to be the center of a coordinate system. | ||
# Let <math>e_1</math> be the axis to the right camera, <math>e_2</math> to be the up axis, and take <math>e_3 = e_1 \times e_2</math>. | # Let <math>e_1</math> be the axis to the right camera, <math>e_2</math> to be the up axis, and take <math>e_3 = e_1 \times e_2</math>. | ||
===Random dot stereograms=== | |||
Shows that recognition is not needed for stereo. | |||
===Similarity Construct=== | |||
* Do matching by computing the sum of square differences (SSD) of a patch along the epipolar lines. | |||
* The ordering of pixels along an epipolar line may not be the same between left and right images. | |||
===Correspondence + Segmentation=== | |||
* Assumption: Similar pixels in a segmentation map will probably have the same disparity. | |||
# For each shift, find the connected components. | |||
# For each point p, pick the largest connected component. | |||
===Essential Matrix=== | |||
The essential matrix satisfies <math>\hat{p}' E \hat{p} = 0</math> where <math>\hat{p} = M^{-1}p</math> and <math>\hat{p}'=M'^{-1}p'</math>. | |||
The fundamental matrix is <math>F=M'^{-T} E M^{-1}</math>. | |||
;Properties | |||
* The matrix is 3x3. | |||
* If <math>F</math> is the essential matrix of (P, P') then <math>F^T</math> is the essential matrix of (P', P). | |||
* The essential matrix can give you the equation of the epipolar line in the second image. | |||
** <math>l'=Fp</math> and <math>l=F^T p'</math> | |||
* For any p, the epipolar line <math>l'=Fp</math> contains the epipole <math>e'</math>. This is since they come from the camera in the image. | |||
** <math>e'^T F = 0</math> and <math>Fe=0</math> | |||
[https://www.youtube.com/watch?v=DgGV3l82NTk Fundamental matrix song] | |||
==Structure from Motion== | |||
Optical Flow | |||
===Only Translation=== | |||
<math>u = \frac{-V + xW}{Z} = \frac{W}{Z}(-\frac{U}{W} + x) = \frac{W}{Z}(x - x_0)</math> | |||
<math>v = \frac{-V + \gamma W}{Z} = \frac{W}{Z}(-\frac{V}{W} + \gamma) = \frac{W}{Z}(y - y_0)</math> | |||
The direction of the translation is: | |||
<math>\frac{v}{u} = \frac{y-y_0}{x-x_0}</math> | |||
The all eminate from the focus of expansion. | |||
If you walk towards a point in the image, then all pixels will flow away from that point. | |||
===Only Rotation=== | |||
Rotation around x axis: | |||
<math>x = \alpha x y - \beta (1 + x^2) - \gamma y</math> | |||
Rotation around y or z axis leads to hyperbolas. | |||
The rotation is independent of depth. | |||
===Both translation and rotation=== | |||
The flow field will not resemble any of the above patterns. | |||
===The velocity of p=== | |||
===Moving plane=== | |||
For a point on a plane p and a normal vector n, the set of all points on the plane is <math>\{x | (x \cdot n) = d\}</math> where <math>d=(p \cdot n)</math> is the distance to the plane from the origin along the normal vector. | |||
===Scaling ambiguity=== | |||
Depth can be recovered up to a scale factor. | |||
===Non-Linear Least Squares Approach=== | |||
Minimize the function: | |||
<math> | |||
\sum [d^2 (p'Fp) + d^2 (pFp')] | |||
</math> | |||
===Locating the epipoles=== | |||
==3D Reconstruction== | |||
===Triangulation=== | |||
If cameras are intrinsically and extrinsically calibrated, then P is the midpoint of the common perpendicular. | |||
===Point reconstruction=== | |||
Given a point X in R3 | |||
* <math>x=MX</math> is the point in image 1 | |||
* <math>x'=M'X</math> is the point in image 2 | |||
<math> | |||
M = \begin{bmatrix} | |||
m_1^T \\ m_2^T \\ m_3^T | |||
\end{bmatrix} | |||
</math> | |||
<math>x \times MX = 0</math> | |||
<math>x \times M'X = 0</math> | |||
implies | |||
<math>AX=0</math> where <math>A = \begin{bmatrix} | |||
x m_3^T - m_1^T\\ | |||
y m_3^T - m_2^T\\ | |||
x' m_3'^T - m_1'^T\\ | |||
y' m_3'^T - m_2'^T\\ | |||
\end{bmatrix}</math> | |||
===Reconstruction for intrinsically calibrated cameras=== | |||
# Compute the essential matrix E using normalized points | |||
# Select M=[I|0] M'=[R|T] then E=[T_x]R | |||
# Find T and R using SVD of E. | |||
===Reconstruction ambiguity: projective=== | |||
<math>x_h = MX_i = (MH_p^{-1})(H_P X_i)</math> | |||
* Moving the camera will get a different reconstruction even with the same image. The 3D model will be changed by some homography. | |||
* If you know 5 points in 3D, you can rectify the 3D model. | |||
;Projective Reconstruction Theorem | |||
* We can compute a projective reconstruction of a scene from 2 views. | |||
* We don't have to know the calibration or poses. | |||
===Affine Reconstruction=== | |||
==Aperture Problem== | |||
When looking through a small viewport (locally) at large objects, you cannot tell which direction it is moving. | |||
See [https://www.opticalillusion.net/optical-illusions/the-barber-pole-illusion/ the barber pole illusion] | |||
===Brightness Constancy Equation=== | |||
===Brightness Constraint Equation=== | |||
Let <math>E(x,y,t)</math> be the irradiance and <math>u(x,y),v(x,y)</math> the components of optical flow. | |||
Then <math>E(x + u \delta t, y + v \delta t, t + \delta t) = E(x,y,t)</math>. | |||
Assume <math>E(x(y), y(t), t) = constant</math> | |||
==Structure from Motion Pipeline== | |||
===Calibration=== | |||
# Step 1: Feature Matching | |||
===Fundamental Matrix and Essential Matrix=== | |||
# Step 2: Estimate Fundamental Matrix F | |||
#* <math>x_i'^T F x_i = 0</math> | |||
#* Use SVD to solve for x from <math>Ax=0</math>: <math>A=U \Sigma V^T</math>. The solution is the last singular vector of <math>V</math>. | |||
#* Essential Matrix: <math>E = K^T F K</math> | |||
#* '''Fundamental matrix has 7 degrees of freedom, essential matrix has 5 degrees of freedom''' | |||
===Estimating Camera Pose=== | |||
Estimating Camera Pose from E | |||
Pose P has 6 DoF. Do SVD of the essential matrix to get 4 potential solutions. | |||
You need to do triangulation to select from the 4 solutions. | |||
==Visual Filters== | |||
Have filters which detect humans, cars,... | |||
==Model-based Recognition== | |||
You have a model for each object to recognize.<br> | |||
The recognition system identifies objects from the model database. | |||
===Pose Clustering=== | |||
===Indexing=== | |||
==Texture== | |||
===Synthesis=== | |||
The goal is to generate additional texture samples from an existing texture sample. | |||
===Filters=== | |||
* Difference of Gradients (DoG) | |||
* Gabor Filters | |||
==Lecture Schedule== | |||
* 02/23/2021 - Pinhole camera model | |||
* 02/25/2021 - Camera calibration | |||
* 03/09/2021 - Optical flow, motion fields | |||
* 03/11/2021 - Structure from motion: epipolar constraints, essential matrix, triangulation | |||
* 03/25/2021 - Multiple topics (image motion) | |||
* 03/30/2021 - Independent object motion (flow fields) | |||
* 04/01/2021 - Project 3 Discussion | |||
* 04/15/2021 - Shape from shading, reflectance map | |||
* 04/20/2021 - Shape from shading, normal map | |||
* 04/22/2021 - Recognition, classification | |||
* 04/27/2021 - Visual filters, classification | |||
* 04/29/2021 - Midterm Exam clarifications | |||
* 05/04/2021 - Model-based Recognition | |||
* 05/06/2021 - Texture | |||
==Projects== | ==Projects== |
Latest revision as of 13:51, 24 June 2021
Notes for CMSC733 Classical and Deep Learning Approaches for Geometric Computer Vision taught by Prof. Yiannis Aloimonos.
Convolution and Correlation
See Convolutional neural network.
Traditionally, fixed filters are used instead of learned filters.
Edge Detection
Two ways to detect edges:
- Difference operators
- Models
Image Gradients
- Angle is given by \(\displaystyle \theta = \arctan(\frac{\partial f}{\partial y}, \frac{\partial f}{\partial x})\)
- Edge strength is given by \(\displaystyle \left\Vert (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) \right\Vert\)
Sobel operator is another way to approximate derivatives:
\(\displaystyle
s_x =
\frac{1}{8}
\begin{bmatrix}
-1 & 0 & 1\\
-2 & 0 & 2\\
-1 & 0 & 1
\end{bmatrix}
\) and
\(\displaystyle
s_y =
\frac{1}{8}
\begin{bmatrix}
1 & 2 & 1\\
0 & 0 & 0\\
-1 & -2 & -1
\end{bmatrix}
\)
You can smooth a function by convolving with a Gaussian kernel.
- Laplacian of Gaussian
- Edges are zero crossings of the Laplacian of Gaussian convolved with the signal.
Effect of \(\displaystyle \sigma\) Gaussian kernel size:
- Large sigma detects large scale edges.
- Small sigma detects fine features.
- Scale Space
- With larger sigma, the first derivative peaks (i.e. zero crossings) can move.
- Close-by peaks can also merge as the scale increases.
- An edge will never split.
Subtraction
- Create a smoothed image by convolving with a Gaussian
- Subtract the smoothed image from the original image.
Finding lines in an image
Option 1: Search for line everywhere.
Option 2: Use Hough transform voting.
Hough Transform
Duality between lines in image space and points in Hough space.
Equation for a line in \(\displaystyle d = x \cos \theta + y \sin \theta\).
for all pixels (x,y) on an edge: for all (d, theta): if d = x*cos(theta) + y*sin(theta): H(d, theta) += 1 d, theta = argmax(H)
- Hough transform handles noise better than least squares.
- Each pixel votes for a line in the Hough space. The line in the image space is the intersection of lines in the Hough space.
- Extensions
- Use image gradient.
- Give more votes for stronger edges
- Change sampling to give more/less resolution
- Same procedure with circles, squares, or other shapes.
- Hough transform for curves
Works with any curve that can be written in a parametric form.
Finding corners
\(\displaystyle C = \begin{bmatrix} \sum I_x^2 & \sum I_x I_y\\ \sum I_x I_y & \sum I_y^2 \end{bmatrix} \)
Consider \(\displaystyle C = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} \)
Theoretical model of an eye
- Pick a point in space and the light rays passing through it.
- Pinhole cameras
- Abstractly, a box with a small hole in it.
Homography
Cross-ratio
Solving for homographies
Given 4 correspondences, you can solve for a homography.
Point and line duality
Points on the image correspond to lines/rays in 3D space.
The cross product of these correspond to a plane.
Calibration
Central Projection
\(\displaystyle \begin{bmatrix} u \\ v \\ w \end{bmatrix} = \begin{bmatrix} f & 0 & 0 & 0\\ 0 & f & 0 & 0\\ 0 & 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} x_s \\ y _s \\ z_s \\ 1 \end{bmatrix} \)
Properties of matrix P
\(\displaystyle P = K R [I_3 | -C]\)
- \(\displaystyle K\) is the upper-triangular calibration matrix which has 5 degrees of freedom.
- \(\displaystyle R\) is the rotation matrix with 3 degrees of freedom.
- \(\displaystyle C\) is the camera center with 3 degrees of freedom.
Calibration
- Estimate matrix P using scene points and images.
- Estimate interior parameters and exterior parameters.
Zhang's Approach
Stereo
Parallel Cameras
Consider two cameras, where the right camera is shifted by baseline \(\displaystyle d\) along the x-axis compared to the left camera.
Then for a point \(\displaystyle (x,y,z)\),
\(\displaystyle x_l = \frac{x}{z}\)
\(\displaystyle y_l = \frac{y}{z}\)
\(\displaystyle x_r = \frac{x-d}{z}\)
\(\displaystyle y_r = \frac{y}{z}\).
Thus, the stereo disparity is the ratio of baseline over depth: \(\displaystyle x_l - x_r = \frac{d}{z}\).
With known baseline and correspondence, you can solve for depth \(\displaystyle z\).
Epipolar Geometry
- Warp the two images such that the epipolar lines become horizontal.
- This is called rectification.
The epipoles are where one camera sees the other.
Rectification
- Consider the left camera to be the center of a coordinate system.
- Let \(\displaystyle e_1\) be the axis to the right camera, \(\displaystyle e_2\) to be the up axis, and take \(\displaystyle e_3 = e_1 \times e_2\).
Random dot stereograms
Shows that recognition is not needed for stereo.
Similarity Construct
- Do matching by computing the sum of square differences (SSD) of a patch along the epipolar lines.
- The ordering of pixels along an epipolar line may not be the same between left and right images.
Correspondence + Segmentation
- Assumption: Similar pixels in a segmentation map will probably have the same disparity.
- For each shift, find the connected components.
- For each point p, pick the largest connected component.
Essential Matrix
The essential matrix satisfies \(\displaystyle \hat{p}' E \hat{p} = 0\) where \(\displaystyle \hat{p} = M^{-1}p\) and \(\displaystyle \hat{p}'=M'^{-1}p'\). The fundamental matrix is \(\displaystyle F=M'^{-T} E M^{-1}\).
- Properties
- The matrix is 3x3.
- If \(\displaystyle F\) is the essential matrix of (P, P') then \(\displaystyle F^T\) is the essential matrix of (P', P).
- The essential matrix can give you the equation of the epipolar line in the second image.
- \(\displaystyle l'=Fp\) and \(\displaystyle l=F^T p'\)
- For any p, the epipolar line \(\displaystyle l'=Fp\) contains the epipole \(\displaystyle e'\). This is since they come from the camera in the image.
- \(\displaystyle e'^T F = 0\) and \(\displaystyle Fe=0\)
Structure from Motion
Optical Flow
Only Translation
\(\displaystyle u = \frac{-V + xW}{Z} = \frac{W}{Z}(-\frac{U}{W} + x) = \frac{W}{Z}(x - x_0)\)
\(\displaystyle v = \frac{-V + \gamma W}{Z} = \frac{W}{Z}(-\frac{V}{W} + \gamma) = \frac{W}{Z}(y - y_0)\)
The direction of the translation is:
\(\displaystyle \frac{v}{u} = \frac{y-y_0}{x-x_0}\)
The all eminate from the focus of expansion.
If you walk towards a point in the image, then all pixels will flow away from that point.
Only Rotation
Rotation around x axis: \(\displaystyle x = \alpha x y - \beta (1 + x^2) - \gamma y\)
Rotation around y or z axis leads to hyperbolas. The rotation is independent of depth.
Both translation and rotation
The flow field will not resemble any of the above patterns.
The velocity of p
Moving plane
For a point on a plane p and a normal vector n, the set of all points on the plane is \(\displaystyle \{x | (x \cdot n) = d\}\) where \(\displaystyle d=(p \cdot n)\) is the distance to the plane from the origin along the normal vector.
Scaling ambiguity
Depth can be recovered up to a scale factor.
Non-Linear Least Squares Approach
Minimize the function: \(\displaystyle \sum [d^2 (p'Fp) + d^2 (pFp')] \)
Locating the epipoles
3D Reconstruction
Triangulation
If cameras are intrinsically and extrinsically calibrated, then P is the midpoint of the common perpendicular.
Point reconstruction
Given a point X in R3
- \(\displaystyle x=MX\) is the point in image 1
- \(\displaystyle x'=M'X\) is the point in image 2
\(\displaystyle M = \begin{bmatrix} m_1^T \\ m_2^T \\ m_3^T \end{bmatrix} \)
\(\displaystyle x \times MX = 0\)
\(\displaystyle x \times M'X = 0\)
implies
\(\displaystyle AX=0\) where \(\displaystyle A = \begin{bmatrix}
x m_3^T - m_1^T\\
y m_3^T - m_2^T\\
x' m_3'^T - m_1'^T\\
y' m_3'^T - m_2'^T\\
\end{bmatrix}\)
Reconstruction for intrinsically calibrated cameras
- Compute the essential matrix E using normalized points
- Select M=[I|0] M'=[R|T] then E=[T_x]R
- Find T and R using SVD of E.
Reconstruction ambiguity: projective
\(\displaystyle x_h = MX_i = (MH_p^{-1})(H_P X_i)\)
- Moving the camera will get a different reconstruction even with the same image. The 3D model will be changed by some homography.
- If you know 5 points in 3D, you can rectify the 3D model.
- Projective Reconstruction Theorem
- We can compute a projective reconstruction of a scene from 2 views.
- We don't have to know the calibration or poses.
Affine Reconstruction
Aperture Problem
When looking through a small viewport (locally) at large objects, you cannot tell which direction it is moving.
See the barber pole illusion
Brightness Constancy Equation
Brightness Constraint Equation
Let \(\displaystyle E(x,y,t)\) be the irradiance and \(\displaystyle u(x,y),v(x,y)\) the components of optical flow.
Then \(\displaystyle E(x + u \delta t, y + v \delta t, t + \delta t) = E(x,y,t)\).
Assume \(\displaystyle E(x(y), y(t), t) = constant\)
Structure from Motion Pipeline
Calibration
- Step 1: Feature Matching
Fundamental Matrix and Essential Matrix
- Step 2: Estimate Fundamental Matrix F
- \(\displaystyle x_i'^T F x_i = 0\)
- Use SVD to solve for x from \(\displaystyle Ax=0\): \(\displaystyle A=U \Sigma V^T\). The solution is the last singular vector of \(\displaystyle V\).
- Essential Matrix: \(\displaystyle E = K^T F K\)
- Fundamental matrix has 7 degrees of freedom, essential matrix has 5 degrees of freedom
Estimating Camera Pose
Estimating Camera Pose from E
Pose P has 6 DoF. Do SVD of the essential matrix to get 4 potential solutions.
You need to do triangulation to select from the 4 solutions.
Visual Filters
Have filters which detect humans, cars,...
Model-based Recognition
You have a model for each object to recognize.
The recognition system identifies objects from the model database.
Pose Clustering
Indexing
Texture
Synthesis
The goal is to generate additional texture samples from an existing texture sample.
Filters
- Difference of Gradients (DoG)
- Gabor Filters
Lecture Schedule
- 02/23/2021 - Pinhole camera model
- 02/25/2021 - Camera calibration
- 03/09/2021 - Optical flow, motion fields
- 03/11/2021 - Structure from motion: epipolar constraints, essential matrix, triangulation
- 03/25/2021 - Multiple topics (image motion)
- 03/30/2021 - Independent object motion (flow fields)
- 04/01/2021 - Project 3 Discussion
- 04/15/2021 - Shape from shading, reflectance map
- 04/20/2021 - Shape from shading, normal map
- 04/22/2021 - Recognition, classification
- 04/27/2021 - Visual filters, classification
- 04/29/2021 - Midterm Exam clarifications
- 05/04/2021 - Model-based Recognition
- 05/06/2021 - Texture