Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 14:08, 4 September 2020

445 bytes added , 4 September 2020

→‎DL Optimization

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 80: / Line 80: @@
 Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
-For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
+For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
 Tangent Kernels:
@@ Line 88: / Line 88: @@
 where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
-;Lemma
+;Theorem 4.1 (Uniform conditioning implies PL* condition)
-If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\).
+If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).
 {{ hidden | Proof |
 <math>
@@ Line 96: / Line 96: @@
 &=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
 &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
-&= \lambda_{\min}(K(w)) L(w)
+&= \lambda_{\min}(K(w)) L(w)\\
+&\geq \mu L(w)
 \end{aligned}
 </math>
@@ Line 122: / Line 123: @@
 }}
-{{hidden | Convergence proof: (Theorem 4.2 <ref name="liu2020towards"></ref>) |
+;Theorem 4.1 (Local PL* implies existence of a solution + fast convergence)
-Local PL implies existence of a solution + fast convergence.
+Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
-Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition.
+Then:
-Assume these are true around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
+* Existence of a solution: There exists a global minimum of <math>L</math>, <math>\mathbf{w}^* \in B(\mathbf{w}_0, R)</math> s.t. <math>F(\mathbf{w}^*)=\mathbf{y}</math>.
+* Convergence of GD: GD with step size <math>\eta = 1/\sup_{\mathbf{w}}\lambda_{\max}(H)</math> converges with exponential convergence rate:
+*: <math display="block">L(w_t) \leq \left(1-\kappa^{-1}(B(\mathbf{w}, R))\right)^t L(\mathbf{w}_0)</math>
+{{hidden | Convergence proof: |
 We want to prove: