Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 15:28, 8 September 2020

24 bytes added , 8 September 2020

→‎DL Optimization

David

Bureaucrats, Interface administrators, Administrators

5,350

edits

@@ Line 81: / Line 81: @@
 For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
-Tangent Kernels:
+===Tangent Kernels===
 Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
 Then our tangent kernel is:
@@ Line 87: / Line 87: @@
 where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
-;Theorem 4.1 (Uniform conditioning implies PL* condition)
+===Theorem 4.1 (Uniform conditioning implies PL* condition)===
 If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).
 {{ hidden | Proof |
@@ Line 122: / Line 122: @@
 }}
-;Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)
+===Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)===
 Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
 Then:
@@ Line 137: / Line 137: @@
 }}
-;Beginning of Lecture 3 (September 8, 2020)
+===Beginning of Lecture 3 (September 8, 2020)===
 Last time (lecture 2) we expect zero training error from overparameterization.
@@ Line 148: / Line 148: @@
 If our loss satisfies PL and some other conditions, then it will converge with gradient descent.
-;Intuition of PL:
+===Intuition of PL===
 Assume:
 # Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL