5,337
edits
Line 80: | Line 80: | ||
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963): | Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963): | ||
For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | ||
Tangent Kernels: | Tangent Kernels: | ||
Line 88: | Line 88: | ||
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | ||
; | ;Theorem 4.1 (Uniform conditioning implies PL* condition) | ||
If \(\lambda_{\min} (K(w)) \geq \mu \ | If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\). | ||
{{ hidden | Proof | | {{ hidden | Proof | | ||
<math> | <math> | ||
Line 96: | Line 96: | ||
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ | &=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ | ||
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | ||
&= \lambda_{\min}(K(w)) L(w) | &= \lambda_{\min}(K(w)) L(w)\\ | ||
&\geq \mu L(w) | |||
\end{aligned} | \end{aligned} | ||
</math> | </math> | ||
Line 122: | Line 123: | ||
}} | }} | ||
;Theorem 4.1 (Local PL* implies existence of a solution + fast convergence) | |||
Local PL implies existence of a solution + fast convergence | Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>. | ||
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition | Then: | ||
* Existence of a solution: There exists a global minimum of <math>L</math>, <math>\mathbf{w}^* \in B(\mathbf{w}_0, R)</math> s.t. <math>F(\mathbf{w}^*)=\mathbf{y}</math>. | |||
* Convergence of GD: GD with step size <math>\eta = 1/\sup_{\mathbf{w}}\lambda_{\max}(H)</math> converges with exponential convergence rate: | |||
*: <math display="block">L(w_t) \leq \left(1-\kappa^{-1}(B(\mathbf{w}, R))\right)^t L(\mathbf{w}_0)</math> | |||
{{hidden | Convergence proof: | | |||
We want to prove: | We want to prove: |