5,343
edits
No edit summary |
|||
Line 42: | Line 42: | ||
==DL Optimization== | ==DL Optimization== | ||
(Lectures 2-3, September 3) | |||
The role of "over-parameterization". | The role of "over-parameterization". | ||
In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | ||
However, in practice GD & SGD work pretty well. | However, in practice GD & SGD work pretty well. | ||
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>. | Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>. | ||
Suppose we have a classification problem with \(c\) labels and \(N\) samples total. | Suppose we have a classification problem with \(c\) labels and \(N\) samples total. This gives us \(n=N*c\) constraints. | ||
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). | Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). | ||
This is called the over-parameterized regime. | This is called the over-parameterized regime. | ||
Line 134: | Line 136: | ||
Note that <math>\eta \mu</math> is the inverse of the condition number. | Note that <math>\eta \mu</math> is the inverse of the condition number. | ||
}} | }} | ||
;Beginning of Lecture 3 (September 8, 2020) | |||
Last time (lecture 2) we expect zero training error from overparameterization. | |||
This is the interpolation regime. | |||
Here the number of parameters \(m\) is greater than the number of samples \(n=Nc\). | |||
Here the manifold of \(w^*\) is the intersection of functions where \(F(w, x_1)=Y_1\) and \(F(w,x_2)=Y_2\). | |||
Assume we use squared loss \(L(w)=\frac{1}{2}\Vert F(w)-Y \Vert^2\). | |||
In general, our loss function is non convex, even locally. | |||
PL-condition: If the magnitude of gradient is greater than a constant multiple \(\mu\) of loss then we say it is \(\mu\)-PL. | |||
If our loss satisfies PL and some other conditions, then it will converge with gradient descent. | |||
;Intuition of PL: | |||
Assume: | |||
# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL | |||
# Loss is <math display="inline">\beta</math>-smooth | |||
# The hessian is bounded by <math display="inline">\beta</math> | |||
# <math display="inline">\eta=\frac{1}{\beta}</math> | |||
# Assume gradient descent: <math display="inline>w_{t+1}=w_t-\mu \nabla L(w_t)</math> | |||
By Taylor's expansion: | |||
<math display="block"> | |||
\begin{aligned} | |||
L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w)(w_{t+1}-w_{t})\\ | |||
&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w)(-\eta) \nabla L(w_t)\\ | |||
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w) \nabla L(w_t)\\ | |||
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\ | |||
&\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\ | |||
&\leq L(w_t) - \eta \mu L(w_t) &\text{by }\mu\text{-PL assumption}\\ | |||
&= (1-\eta \mu)L(w_t) | |||
\end{aligned} | |||
</math> | |||
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. | |||
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | |||
==Misc== | ==Misc== |