Deep Learning: Difference between revisions

1,550 bytes added ,  3 September 2020
Line 89: Line 89:


;Lemma
;Lemma
If \(\lambda_{\min} K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).
If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\).
{{ hidden | Proof |
<math>
\begin{aligned}
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&= \lambda_{\min}(K(w)) L(w)
\end{aligned}
</math>
}}
 
Informal convergence result: 
GD converges exp fast to a solution with the rate controlled by a condition number:
\[ \kappa_{F} = \frac{\sup_{B} \lambda_{\max}(H)}{\inf_{B} \lambda_{\min}(K)}\]
Ideally we want this to be small.
 
{{ hidden | Example |
 
<math>F(w) = Aw = y \in \mathbb{R}^n</math> 
<math>L(w) = \frac{1}{2} \Vert Aw-y\Vert^2</math> 
Our tangent kernel is:
<math>K(w)=AA^T</math>
and our hessian is <math>H=A^TA</math>.
 
<math>\kappa_{F} = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)} \geq 1</math>.
 
Our intuition is:
<math>K=AA^T \implies E[\log K_F] \approx \log(\frac{M}{|m-n|+1})</math>. 
So as <math>m \to \infty</math>, <math>\kappa_{F} \to 1</math>.
 
}}
 
{{hidden | Convergence proof: (Theorem 4.2 <ref name="liu2020towards"></ref>) |
Local PL implies existence of a solution + fast convergence. 
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition. 
Assume these are true around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
 
We want to prove:
* A solution exists.
* <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>
 
Note that <math>\eta \mu</math> is the inverse of the condition number.
}}


==Misc==
==Misc==