Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 89: / Line 89: @@
 ;Lemma
-If \(\lambda_{\min} K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).
+If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\).
+{{ hidden | Proof |
+<math>
+\begin{aligned}
+\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
+&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
+&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
+&= \lambda_{\min}(K(w)) L(w)
+\end{aligned}
+</math>
+}}
+Informal convergence result:
+GD converges exp fast to a solution with the rate controlled by a condition number:
+\[ \kappa_{F} = \frac{\sup_{B} \lambda_{\max}(H)}{\inf_{B} \lambda_{\min}(K)}\]
+Ideally we want this to be small.
+{{ hidden | Example |
+<math>F(w) = Aw = y \in \mathbb{R}^n</math>
+<math>L(w) = \frac{1}{2} \Vert Aw-y\Vert^2</math>
+Our tangent kernel is:
+<math>K(w)=AA^T</math>
+and our hessian is <math>H=A^TA</math>.
+<math>\kappa_{F} = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)} \geq 1</math>.
+Our intuition is:
+<math>K=AA^T \implies E[\log K_F] \approx \log(\frac{M}{|m-n|+1})</math>.
+So as <math>m \to \infty</math>, <math>\kappa_{F} \to 1</math>.
+}}
+{{hidden | Convergence proof: (Theorem 4.2 <ref name="liu2020towards"></ref>) |
+Local PL implies existence of a solution + fast convergence.
+Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition.
+Assume these are true around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
+We want to prove:
+* A solution exists.
+* <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>
+Note that <math>\eta \mu</math> is the inverse of the condition number.
+}}
 ==Misc==