Deep Learning: Difference between revisions

2,029 bytes added ,  8 September 2020
No edit summary
Line 42: Line 42:


==DL Optimization==
==DL Optimization==
(Lectures 2-3, September 3) 
The role of "over-parameterization".   
The role of "over-parameterization".   
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).   
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).   
However, in practice GD & SGD work pretty well.   
However, in practice GD & SGD work pretty well.   
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.


Suppose we have a classification problem with \(c\) labels and \(N\) samples total.   
Suppose we have a classification problem with \(c\) labels and \(N\) samples total. This gives us \(n=N*c\) constraints.   
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
This is called the over-parameterized regime.
This is called the over-parameterized regime.
Line 134: Line 136:
Note that <math>\eta \mu</math> is the inverse of the condition number.
Note that <math>\eta \mu</math> is the inverse of the condition number.
}}
}}
;Beginning of Lecture 3 (September 8, 2020)
Last time (lecture 2) we expect zero training error from overparameterization.
This is the interpolation regime. 
Here the number of parameters \(m\) is greater than the number of samples \(n=Nc\). 
Here the manifold of \(w^*\) is the intersection of functions where \(F(w, x_1)=Y_1\) and \(F(w,x_2)=Y_2\). 
Assume we use squared loss \(L(w)=\frac{1}{2}\Vert F(w)-Y \Vert^2\). 
In general, our loss function is non convex, even locally. 
PL-condition: If the magnitude of gradient is greater than a constant multiple \(\mu\) of loss then we say it is \(\mu\)-PL. 
If our loss satisfies PL and some other conditions, then it will converge with gradient descent.
;Intuition of PL:
Assume:
# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL
# Loss is <math display="inline">\beta</math>-smooth
# The hessian is bounded by <math display="inline">\beta</math>
# <math display="inline">\eta=\frac{1}{\beta}</math>
# Assume gradient descent: <math display="inline>w_{t+1}=w_t-\mu \nabla L(w_t)</math>
By Taylor's expansion:
<math display="block">
\begin{aligned}
L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w)(w_{t+1}-w_{t})\\
&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w)(-\eta) \nabla L(w_t)\\
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w) \nabla L(w_t)\\
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\
&\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\
&\leq L(w_t) - \eta \mu L(w_t) &\text{by }\mu\text{-PL assumption}\\
&= (1-\eta \mu)L(w_t)
\end{aligned}
</math>
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. 
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>.


==Misc==
==Misc==