Geometric Computer Vision: Difference between revisions

Created page with "Notes for CMSC733 taught by [http://legacydirs.umiacs.umd.edu/~yiannis/ Prof. Yiannis Aloimonos] * [http://prg.cs.umd.edu/cmsc733 Course webpage]"
 
 
(57 intermediate revisions by the same user not shown)
Line 1: Line 1:
Notes for CMSC733 taught by [http://legacydirs.umiacs.umd.edu/~yiannis/ Prof. Yiannis Aloimonos]
Notes for CMSC733 Classical and Deep Learning Approaches for Geometric Computer Vision taught by [http://legacydirs.umiacs.umd.edu/~yiannis/ Prof. Yiannis Aloimonos].


* [http://prg.cs.umd.edu/cmsc733 Course webpage]
* [http://prg.cs.umd.edu/cmsc733 Course webpage]
==Convolution and Correlation==
See [[Convolutional neural network]]. 
Traditionally, fixed filters are used instead of learned filters.
==Edge Detection==
Two ways to detect edges:
* Difference operators
* Models
===Image Gradients===
* Angle is given by <math>\theta = \arctan(\frac{\partial f}{\partial y}, \frac{\partial f}{\partial x})</math>
* Edge strength is given by <math>\left\Vert (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}) \right\Vert</math>
Sobel operator is another way to approximate derivatives:<br>
<math>
s_x =
\frac{1}{8}
\begin{bmatrix}
-1 & 0 & 1\\
-2 & 0 & 2\\
-1 & 0 & 1
\end{bmatrix}
</math> and
<math>
s_y =
\frac{1}{8}
\begin{bmatrix}
1 & 2 & 1\\
0 & 0 & 0\\
-1 & -2 & -1
\end{bmatrix}
</math>
You can smooth a function by convolving with a Gaussian kernel.
;Laplacian of Gaussian
* Edges are zero crossings of the Laplacian of Gaussian convolved with the signal.
Effect of <math>\sigma</math> Gaussian kernel size:
* Large sigma detects large scale edges.
* Small sigma detects fine features.
;Scale Space
* With larger sigma, the first derivative peaks (i.e. zero crossings) can move.
* Close-by peaks can also merge as the scale increases.
* An edge will never split.
===Subtraction===
* Create a smoothed image by convolving with a Gaussian
* Subtract the smoothed image from the original image.
===Finding lines in an image===
Option 1: Search for line everywhere. 
Option 2: Use Hough transform voting.
===Hough Transform===
Duality between lines in image space and points in Hough space. 
Equation for a line in <math>d = x \cos \theta + y \sin \theta</math>. 
<pre>
for all pixels (x,y) on an edge:
  for all (d, theta):
    if d = x*cos(theta) + y*sin(theta):
      H(d, theta) += 1
d, theta = argmax(H)
</pre>
* Hough transform handles noise better than least squares.
* Each pixel votes for a ''line'' in the Hough space. The line in the image space is the intersection of lines in the Hough space.
;Extensions
* Use image gradient.
* Give more votes for stronger edges
* Change sampling to give more/less resolution
* Same procedure with circles, squares, or other shapes.
;Hough transform for curves
Works with any curve that can be written in a parametric form.
===Finding corners===
<math>
C = \begin{bmatrix}
\sum I_x^2 & \sum I_x I_y\\
\sum I_x I_y & \sum I_y^2
\end{bmatrix}
</math>
Consider <math>
C = \begin{bmatrix}
\lambda_1 & 0 \\
0 & \lambda_2
\end{bmatrix}
</math>
===Theoretical model of an eye===
* Pick a point in space and the light rays passing through it.
* Pinhole cameras
** Abstractly, a box with a small hole in it.
==Homography==
===Cross-ratio===
See [[Wikipedia: Cross-ratio]].
===Solving for homographies===
Given 4 correspondences, you can solve for a homography. 
===Point and line duality===
Points on the image correspond to lines/rays in 3D space. 
The cross product of these correspond to a plane.
==Calibration==
===Central Projection===
<math>
\begin{bmatrix}
u \\ v \\ w
\end{bmatrix}
=
\begin{bmatrix}
f & 0 & 0 & 0\\
0 & f & 0 & 0\\
0 & 0 & 1 & 0
\end{bmatrix}
\begin{bmatrix}
x_s \\ y _s \\ z_s \\ 1
\end{bmatrix}
</math>
===Properties of matrix P===
<math>P = K R [I_3 | -C]</math>
* <math>K</math> is the upper-triangular calibration matrix which has 5 degrees of freedom.
* <math>R</math> is the rotation matrix with 3 degrees of freedom.
* <math>C</math> is the camera center with 3 degrees of freedom.
===Calibration===
# Estimate matrix P using scene points and images.
# Estimate interior parameters and exterior parameters.
===Zhang's Approach===
==Stereo==
===Parallel Cameras===
Consider two cameras, where the right camera is shifted by baseline <math>d</math> along the x-axis compared to the left camera. 
Then for a point <math>(x,y,z)</math>,
<math>x_l = \frac{x}{z}</math> 
<math>y_l = \frac{y}{z}</math> 
<math>x_r = \frac{x-d}{z}</math> 
<math>y_r = \frac{y}{z}</math>. 
Thus, the stereo disparity is the ratio of baseline over depth: <math>x_l - x_r = \frac{d}{z}</math>. 
With known baseline and correspondence, you can solve for depth <math>z</math>.
===Epipolar Geometry===
# Warp the two images such that the epipolar lines become horizontal.
# This is called rectification.
The ''epipoles'' are where one camera sees the other.
===Rectification===
# Consider the left camera to be the center of a coordinate system.
# Let <math>e_1</math> be the axis to the right camera, <math>e_2</math> to be the up axis, and take <math>e_3 = e_1 \times e_2</math>.
===Random dot stereograms===
Shows that recognition is not needed for stereo.
===Similarity Construct===
* Do matching by computing the sum of square differences (SSD) of a patch along the epipolar lines.
* The ordering of pixels along an epipolar line may not be the same between left and right images.
===Correspondence + Segmentation===
* Assumption: Similar pixels in a segmentation map will probably have the same disparity.
# For each shift, find the connected components.
# For each point p, pick the largest connected component.
===Essential Matrix===
The essential matrix satisfies <math>\hat{p}' E \hat{p} = 0</math> where <math>\hat{p} = M^{-1}p</math> and <math>\hat{p}'=M'^{-1}p'</math>.
The fundamental matrix is <math>F=M'^{-T} E M^{-1}</math>.
;Properties
* The matrix is 3x3.
* If <math>F</math> is the essential matrix of (P, P') then <math>F^T</math> is the essential matrix of (P', P).
* The essential matrix can give you the equation of the epipolar line in the second image.
** <math>l'=Fp</math> and <math>l=F^T p'</math>
* For any p, the epipolar line <math>l'=Fp</math> contains the epipole <math>e'</math>. This is since they come from the camera in the image.
** <math>e'^T F = 0</math> and <math>Fe=0</math>
[https://www.youtube.com/watch?v=DgGV3l82NTk Fundamental matrix song]
==Structure from Motion==
Optical Flow
===Only Translation===
<math>u = \frac{-V + xW}{Z} = \frac{W}{Z}(-\frac{U}{W} + x) = \frac{W}{Z}(x - x_0)</math> 
<math>v = \frac{-V + \gamma W}{Z} = \frac{W}{Z}(-\frac{V}{W} + \gamma) = \frac{W}{Z}(y - y_0)</math>
The direction of the translation is: 
<math>\frac{v}{u} = \frac{y-y_0}{x-x_0}</math> 
The all eminate from the focus of expansion. 
If you walk towards a point in the image, then all pixels will flow away from that point.
===Only Rotation===
Rotation around x axis:
<math>x = \alpha x y - \beta (1 + x^2) - \gamma y</math>
Rotation around y or z axis leads to hyperbolas.
The rotation is independent of depth.
===Both translation and rotation===
The flow field will not resemble any of the above patterns.
===The velocity of p===
===Moving plane===
For a point on a plane p and a normal vector n, the set of all points on the plane is <math>\{x | (x \cdot n) = d\}</math> where <math>d=(p \cdot n)</math> is the distance to the plane from the origin along the normal vector.
===Scaling ambiguity===
Depth can be recovered up to a scale factor.
===Non-Linear Least Squares Approach===
Minimize the function:
<math>
\sum [d^2 (p'Fp) + d^2 (pFp')]
</math>
===Locating the epipoles===
==3D Reconstruction==
===Triangulation===
If cameras are intrinsically and extrinsically calibrated, then P is the midpoint of the common perpendicular.
===Point reconstruction===
Given a point X in R3
* <math>x=MX</math> is the point in image 1
* <math>x'=M'X</math> is the point in image 2
<math>
M = \begin{bmatrix}
m_1^T \\ m_2^T \\ m_3^T
\end{bmatrix}
</math>
<math>x \times MX = 0</math> 
<math>x \times M'X = 0</math> 
implies 
<math>AX=0</math> where <math>A = \begin{bmatrix}
x m_3^T - m_1^T\\
y m_3^T - m_2^T\\
x' m_3'^T - m_1'^T\\
y' m_3'^T - m_2'^T\\
\end{bmatrix}</math>
===Reconstruction for intrinsically calibrated cameras===
# Compute the essential matrix E using normalized points
# Select M=[I|0] M'=[R|T] then E=[T_x]R
# Find T and R using SVD of E.
===Reconstruction ambiguity: projective===
<math>x_h = MX_i = (MH_p^{-1})(H_P X_i)</math>
* Moving the camera will get a different reconstruction even with the same image. The 3D model will be changed by some homography.
* If you know 5 points in 3D, you can rectify the 3D model.
;Projective Reconstruction Theorem
* We can compute a projective reconstruction of a scene from 2 views.
* We don't have to know the calibration or poses.
===Affine Reconstruction===
==Aperture Problem==
When looking through a small viewport (locally) at large objects, you cannot tell which direction it is moving. 
See [https://www.opticalillusion.net/optical-illusions/the-barber-pole-illusion/ the barber pole illusion]
===Brightness Constancy Equation===
===Brightness Constraint Equation===
Let <math>E(x,y,t)</math> be the irradiance and <math>u(x,y),v(x,y)</math> the components of optical flow. 
Then <math>E(x + u \delta t, y + v \delta t, t + \delta t) = E(x,y,t)</math>.
Assume <math>E(x(y), y(t), t) = constant</math>
==Structure from Motion Pipeline==
===Calibration===
# Step 1: Feature Matching
===Fundamental Matrix and Essential Matrix===
# Step 2: Estimate Fundamental Matrix F
#* <math>x_i'^T F x_i = 0</math>
#* Use SVD to solve for x from <math>Ax=0</math>: <math>A=U \Sigma V^T</math>. The solution is the last singular vector of <math>V</math>.
#* Essential Matrix: <math>E = K^T F K</math>
#* '''Fundamental matrix has 7 degrees of freedom, essential matrix has 5 degrees of freedom'''
===Estimating Camera Pose===
Estimating Camera Pose from E 
Pose P has 6 DoF. Do SVD of the essential matrix to get 4 potential solutions. 
You need to do triangulation to select from the 4 solutions.
==Visual Filters==
Have filters which detect humans, cars,...
==Model-based Recognition==
You have a model for each object to recognize.<br>
The recognition system identifies objects from the model database.
===Pose Clustering===
===Indexing===
==Texture==
===Synthesis===
The goal is to generate additional texture samples from an existing texture sample.
===Filters===
* Difference of Gradients (DoG)
* Gabor Filters
==Lecture Schedule==
* 02/23/2021 - Pinhole camera model
* 02/25/2021 - Camera calibration
* 03/09/2021 - Optical flow, motion fields
* 03/11/2021 - Structure from motion: epipolar constraints, essential matrix, triangulation
* 03/25/2021 - Multiple topics (image motion)
* 03/30/2021 - Independent object motion (flow fields)
* 04/01/2021 - Project 3 Discussion
* 04/15/2021 - Shape from shading, reflectance map
* 04/20/2021 - Shape from shading, normal map
* 04/22/2021 - Recognition, classification
* 04/27/2021 - Visual filters, classification
* 04/29/2021 - Midterm Exam clarifications
* 05/04/2021 - Model-based Recognition
* 05/06/2021 - Texture
==Projects==