Deep Learning: Difference between revisions

Line 80: Line 80:


Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):   
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):   
For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.


Tangent Kernels:
Tangent Kernels:
Line 88: Line 88:
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)


;Lemma
;Theorem 4.1 (Uniform conditioning implies PL* condition)
If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\).
If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).
{{ hidden | Proof |
{{ hidden | Proof |
<math>
<math>
Line 96: Line 96:
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&= \lambda_{\min}(K(w)) L(w)
&= \lambda_{\min}(K(w)) L(w)\\
&\geq \mu L(w)
\end{aligned}
\end{aligned}
</math>
</math>
Line 122: Line 123:
}}
}}


{{hidden | Convergence proof: (Theorem 4.2 <ref name="liu2020towards"></ref>) |
;Theorem 4.1 (Local PL* implies existence of a solution + fast convergence)
Local PL implies existence of a solution + fast convergence
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition
Then:
Assume these are true around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
* Existence of a solution: There exists a global minimum of <math>L</math>, <math>\mathbf{w}^* \in B(\mathbf{w}_0, R)</math> s.t. <math>F(\mathbf{w}^*)=\mathbf{y}</math>.
* Convergence of GD: GD with step size <math>\eta = 1/\sup_{\mathbf{w}}\lambda_{\max}(H)</math> converges with exponential convergence rate:
*: <math display="block">L(w_t) \leq \left(1-\kappa^{-1}(B(\mathbf{w}, R))\right)^t L(\mathbf{w}_0)</math>
{{hidden | Convergence proof: |


We want to prove:
We want to prove: