Jump to content

Deep Learning: Difference between revisions

Line 81: Line 81:
For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.


Tangent Kernels:
===Tangent Kernels===
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).   
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).   
Then our tangent kernel is:   
Then our tangent kernel is:   
Line 87: Line 87:
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)


;Theorem 4.1 (Uniform conditioning implies PL* condition)
===Theorem 4.1 (Uniform conditioning implies PL* condition)===
If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).
If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).
{{ hidden | Proof |
{{ hidden | Proof |
Line 122: Line 122:
}}
}}


;Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)
===Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)===
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>.
Then:
Then:
Line 137: Line 137:
}}
}}


;Beginning of Lecture 3 (September 8, 2020)
===Beginning of Lecture 3 (September 8, 2020)===


Last time (lecture 2) we expect zero training error from overparameterization.
Last time (lecture 2) we expect zero training error from overparameterization.
Line 148: Line 148:
If our loss satisfies PL and some other conditions, then it will converge with gradient descent.
If our loss satisfies PL and some other conditions, then it will converge with gradient descent.


;Intuition of PL:
===Intuition of PL===
Assume:
Assume:
# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL
# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL