Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 15:27, 8 September 2020

2,029 bytes added , 8 September 2020

→‎DL Optimization

David

Bureaucrats, Interface administrators, Administrators

5,343

edits

@@ Line 42: / Line 42: @@
 ==DL Optimization==
+(Lectures 2-3, September 3)
 The role of "over-parameterization".
 In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
 However, in practice GD & SGD work pretty well.
 Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.
-Suppose we have a classification problem with \(c\) labels and \(N\) samples total.
+Suppose we have a classification problem with \(c\) labels and \(N\) samples total. This gives us \(n=N*c\) constraints.
 Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
 This is called the over-parameterized regime.
@@ Line 134: / Line 136: @@
 Note that <math>\eta \mu</math> is the inverse of the condition number.
 }}
+;Beginning of Lecture 3 (September 8, 2020)
+Last time (lecture 2) we expect zero training error from overparameterization.
+This is the interpolation regime.
+Here the number of parameters \(m\) is greater than the number of samples \(n=Nc\).
+Here the manifold of \(w^*\) is the intersection of functions where \(F(w, x_1)=Y_1\) and \(F(w,x_2)=Y_2\).
+Assume we use squared loss \(L(w)=\frac{1}{2}\Vert F(w)-Y \Vert^2\).
+In general, our loss function is non convex, even locally.
+PL-condition: If the magnitude of gradient is greater than a constant multiple \(\mu\) of loss then we say it is \(\mu\)-PL.
+If our loss satisfies PL and some other conditions, then it will converge with gradient descent.
+;Intuition of PL:
+Assume:
+# Loss <math>L(w)_{w \in B}</math> is <math display="inline">\mu</math>-PL
+# Loss is <math display="inline">\beta</math>-smooth
+# The hessian is bounded by <math display="inline">\beta</math>
+# <math display="inline">\eta=\frac{1}{\beta}</math>
+# Assume gradient descent: <math display="inline>w_{t+1}=w_t-\mu \nabla L(w_t)</math>
+By Taylor's expansion:
+<math display="block">
+\begin{aligned}
+L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w)(w_{t+1}-w_{t})\\
+&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w)(-\eta) \nabla L(w_t)\\
+&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w) \nabla L(w_t)\\
+&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\
+&\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\
+&\leq L(w_t) - \eta \mu L(w_t) &\text{by }\mu\text{-PL assumption}\\
+&= (1-\eta \mu)L(w_t)
+\end{aligned}
+</math>
+This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>.
+Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>.
 ==Misc==