Numerical Optimization: Difference between revisions
| (15 intermediate revisions by the same user not shown) | |||
| Line 2: | Line 2: | ||
==Convergence Rates==  | |||
[https://en.wikipedia.org/wiki/Rate_of_convergence Wikipedia]<br>  | |||
The rate of convergence is <math>\lim_{k \rightarrow \infty} \frac{|x_{k+1}-x^*|}{|x_{k}-x^*|^q}=L</math><br>  | |||
Iterative methods have the following convergence rates:  | |||
* If Q=1 and L=1 we have sublinear convergence.   | |||
* If Q=1 and <math>L\in(0,1)</math> we have linear convergence.  | |||
* If Q=1 and <math>L=0</math> we have superlinear convergence.  | |||
* If Q=2 and <math>L \leq \infty</math> we have quadratic convergence.  | |||
==Line Search Methods==  | ==Line Search Methods==  | ||
| Line 26: | Line 34: | ||
At each iteration, you solve a constrained optimization subproblem to find the best step <math>p</math>.<br>  | At each iteration, you solve a constrained optimization subproblem to find the best step <math>p</math>.<br>  | ||
<math>\min_{p \in \mathbb{R}^n} m_k(p)</math> such that <math>\Vert p \Vert < \Delta_k </math>.  | <math>\min_{p \in \mathbb{R}^n} m_k(p)</math> such that <math>\Vert p \Vert < \Delta_k </math>.  | ||
===Cauchy Point Algorithms===  | |||
The Cauchy point <math>p_k^c = \tau_k p_k^s</math><br>  | |||
where <math>p_k^s</math> minimizes the linear model in the trust region<br>  | |||
<math> p_k^s = \operatorname{argmin}_{p \in \mathbb{R}^n} f_k + g_k^Tp </math> s.t. <math>\Vert p \Vert \leq \Delta_k </math><br>  | |||
and <math>\tau_k</math> minimizes our quadratic model along the line <math>p_k^s</math>:<br>  | |||
<math>\tau_k = \operatorname{argmin}_{\tau \geq 0} m_k(\tau p_k^s)</math> s.t. <math>\Vert \tau p_k^s \Vert \leq \Delta_k </math><br>  | |||
This can be written explicitly as <math>p_k^c = - \tau_k \frac{\Delta_k}{\Vert g_K \Vert} g_k</math> where <math>\tau_k =   | |||
\begin{cases}  | |||
1 & \text{if }g_k^T B_k g_k \leq 0;\\  | |||
\min(\Vert g_k \Vert ^3/(\Delta_k g_k^T B_k  g_k), 1) & \text{otherwise}  | |||
\end{cases}  | |||
</math>  | |||
==Conjugate Gradient Methods==  | |||
[https://www.cs.cmu.edu/~15859n/RelatedWork/painless-conjugate-gradient.pdf Painless Conjugate Gradient]<br>  | |||
The goal is to solve <math>Ax=b</math> or equivalently <math>\min \phi(x)</math> where <math>\phi(x)=(1/2)x^T A x - b^Tx</math>.<br>  | |||
The practical CG algorithm will converge in at most <math>n</math> steps.  | |||
===Definitions===  | |||
Vectors <math>\{p_i\}</math> are conjugate w.r.t SPD matrix <math>A</math> if <math>p_i^T A p_j = 0</math> for <math>i \neq j</math>  | |||
;Notes  | |||
* <math>\{p_i\}</math> are linearly independent  | |||
===Algorithms===  | |||
Basic idea:<br>  | |||
* Find a conjugate direction <math>p_k</math>  | |||
* Take the step size <math>\alpha_k</math> which minimizes along <math>\phi(x)</math>  | |||
Practical CG method:<br>  | |||
{{ hidden | Code |   | |||
Below is code for the practical CG method.<br>  | |||
<pre>  | |||
x = x0;  | |||
r = A*x - b;  | |||
p = -r;  | |||
k = 1;  | |||
while(norm(r) > tole && k < size(x,2))  | |||
   Ap = A*p;  | |||
   alpha = (r'*r)/(p'*Ap);  | |||
   x = x + alpha * p;  | |||
   rn = r + alpha * Ap;  | |||
   beta = (rn'*rn)/(r'*r);  | |||
   p = -rn + beta * p;  | |||
   r = rn;  | |||
   k = k+1;  | |||
end  | |||
</pre>  | |||
}}  | |||
===Theorems===  | |||
Given a set of conjugate directions <math>\{p_0,...,p_{n-1}\}</math> we can generate a sequence of <math>x_k</math> with<br>  | |||
<math> x_{k+1}=x_k + \alpha_k p_k</math> where <math>\alpha_k = -\frac{r_k^T p_k}{p_k^T A p_k}</math> minimizes <math>\phi(x)</math><br>  | |||
; Theorem: For an <math>x_0</math>, the sequence <math>x_k</math> converges to the solution <math>x^*</math> in at most n steps.  | |||
; Theorem: At step k, the residual <math>r_k</math> is orthogonal to <math>\{p_0,...p_{k-1}\}</math> and the current iteration <math>x_{k}</math>   | |||
: minimizes <math>\phi(x)</math> over the set <math>\{x_0 + span(p_0,...,p_{k-1})\}</math>  | |||
===Convergence Rate===  | |||
The convergence rate can be estimated by:<br>  | |||
<math>\min_{P_k} \max_{1 \leq i \leq k} [1 + \lambda_i P_k(\lambda_i)]^2</math>  | |||
==Resources==  | ==Resources==  | ||
* [https://link.springer.com/book/10.1007%2F978-0-387-40065-5 Numerical Optimization by Nocedal and Wright (2006)  | * [https://link.springer.com/book/10.1007%2F978-0-387-40065-5 Numerical Optimization by Nocedal and Wright (2006)]  | ||
Latest revision as of 14:09, 14 November 2019
Numerical Optimization
Convergence Rates
Wikipedia
The rate of convergence is \(\displaystyle \lim_{k \rightarrow \infty} \frac{|x_{k+1}-x^*|}{|x_{k}-x^*|^q}=L\)
Iterative methods have the following convergence rates:
- If Q=1 and L=1 we have sublinear convergence.
 - If Q=1 and \(\displaystyle L\in(0,1)\) we have linear convergence.
 - If Q=1 and \(\displaystyle L=0\) we have superlinear convergence.
 - If Q=2 and \(\displaystyle L \leq \infty\) we have quadratic convergence.
 
Line Search Methods
Basic idea:
- For each iteration
- Find a direction \(\displaystyle p\).
 - Then find a step length \(\displaystyle \alpha\) which decreases \(\displaystyle f\).
 - Take a step \(\displaystyle \alpha p\).
 
 
Trust Region Methods
Basic idea:
- For each iteration
- Assume a quadratic model of your objective function near a point.
 - Find a region where you trust your model accurately represents your objective function.
 - Take a step.
 
 
Variables:
- \(\displaystyle f\) is your objective function.
 - \(\displaystyle m_k\) is your quadratic model at iteration k.
 - \(\displaystyle x_k\) is your point at iteration k.
 
Your model is \(\displaystyle m_k(p) = f_k + g_k^T p + \frac{1}{2}p^T B_k p\)
where \(\displaystyle g_k = \nabla f(x_k)\) and \(\displaystyle B_k\) is a symmetric matrix.
At each iteration, you solve a constrained optimization subproblem to find the best step \(\displaystyle p\).
\(\displaystyle \min_{p \in \mathbb{R}^n} m_k(p)\) such that \(\displaystyle \Vert p \Vert \lt  \Delta_k \).
Cauchy Point Algorithms
The Cauchy point \(\displaystyle p_k^c = \tau_k p_k^s\)
where \(\displaystyle p_k^s\) minimizes the linear model in the trust region
\(\displaystyle  p_k^s = \operatorname{argmin}_{p \in \mathbb{R}^n} f_k + g_k^Tp \) s.t. \(\displaystyle \Vert p \Vert \leq \Delta_k \)
and \(\displaystyle \tau_k\) minimizes our quadratic model along the line \(\displaystyle p_k^s\):
\(\displaystyle \tau_k = \operatorname{argmin}_{\tau \geq 0} m_k(\tau p_k^s)\) s.t. \(\displaystyle \Vert \tau p_k^s \Vert \leq \Delta_k \)
This can be written explicitly as \(\displaystyle p_k^c = - \tau_k \frac{\Delta_k}{\Vert g_K \Vert} g_k\) where \(\displaystyle \tau_k = 
\begin{cases}
1 & \text{if }g_k^T B_k g_k \leq 0;\\
\min(\Vert g_k \Vert ^3/(\Delta_k g_k^T B_k  g_k), 1) & \text{otherwise}
\end{cases}
\)
Conjugate Gradient Methods
Painless Conjugate Gradient
The goal is to solve \(\displaystyle Ax=b\) or equivalently \(\displaystyle \min \phi(x)\) where \(\displaystyle \phi(x)=(1/2)x^T A x - b^Tx\).
The practical CG algorithm will converge in at most \(\displaystyle n\) steps.
Definitions
Vectors \(\displaystyle \{p_i\}\) are conjugate w.r.t SPD matrix \(\displaystyle A\) if \(\displaystyle p_i^T A p_j = 0\) for \(\displaystyle i \neq j\)
- Notes
 
- \(\displaystyle \{p_i\}\) are linearly independent
 
Algorithms
Basic idea:
- Find a conjugate direction \(\displaystyle p_k\)
 - Take the step size \(\displaystyle \alpha_k\) which minimizes along \(\displaystyle \phi(x)\)
 
Practical CG method:
Below is code for the practical CG method.
x = x0; r = A*x - b; p = -r; k = 1; while(norm(r) > tole && k < size(x,2)) Ap = A*p; alpha = (r'*r)/(p'*Ap); x = x + alpha * p; rn = r + alpha * Ap; beta = (rn'*rn)/(r'*r); p = -rn + beta * p; r = rn; k = k+1; end
Theorems
Given a set of conjugate directions \(\displaystyle \{p_0,...,p_{n-1}\}\) we can generate a sequence of \(\displaystyle x_k\) with
\(\displaystyle  x_{k+1}=x_k + \alpha_k p_k\) where \(\displaystyle \alpha_k = -\frac{r_k^T p_k}{p_k^T A p_k}\) minimizes \(\displaystyle \phi(x)\)
- Theorem
 - For an \(\displaystyle x_0\), the sequence \(\displaystyle x_k\) converges to the solution \(\displaystyle x^*\) in at most n steps.
 - Theorem
 - At step k, the residual \(\displaystyle r_k\) is orthogonal to \(\displaystyle \{p_0,...p_{k-1}\}\) and the current iteration \(\displaystyle x_{k}\)
 - minimizes \(\displaystyle \phi(x)\) over the set \(\displaystyle \{x_0 + span(p_0,...,p_{k-1})\}\)
 
Convergence Rate
The convergence rate can be estimated by:
\(\displaystyle \min_{P_k} \max_{1 \leq i \leq k} [1 + \lambda_i P_k(\lambda_i)]^2\)