5,350
edits
Line 81: | Line 81: | ||
For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | For <math>w \in B</math>, <math>\frac{1}{2}\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | ||
Tangent Kernels | ===Tangent Kernels=== | ||
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\). | Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\). | ||
Then our tangent kernel is: | Then our tangent kernel is: | ||
Line 87: | Line 87: | ||
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | ||
===Theorem 4.1 (Uniform conditioning implies PL* condition)=== | |||
If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\). | If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\). | ||
{{ hidden | Proof | | {{ hidden | Proof | | ||
Line 122: | Line 122: | ||
}} | }} | ||
===Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)=== | |||
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>. | Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>. | ||
Then: | Then: | ||
Line 137: | Line 137: | ||
}} | }} | ||
===Beginning of Lecture 3 (September 8, 2020)=== | |||
Last time (lecture 2) we expect zero training error from overparameterization. | Last time (lecture 2) we expect zero training error from overparameterization. | ||
Line 148: | Line 148: | ||
If our loss satisfies PL and some other conditions, then it will converge with gradient descent. | If our loss satisfies PL and some other conditions, then it will converge with gradient descent. | ||
===Intuition of PL=== | |||
Assume: | Assume: | ||
# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL | # Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL |