Deep Learning: Difference between revisions

From David's Wiki
No edit summary
Line 238: Line 238:
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
<ref name="du2019gradient">Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh (2019). Gradient Descent Provably Optimizes Over-parameterized Neural Networks (ICLR 2019) [https://arxiv.org/abs/1810.02054 https://arxiv.org/abs/1810.02054]</ref>
<ref name="du2019gradient">Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh (2019). Gradient Descent Provably Optimizes Over-parameterized Neural Networks (ICLR 2019) [https://arxiv.org/abs/1810.02054 https://arxiv.org/abs/1810.02054]</ref>
<ref name="soudry2018implicit">Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018) The Implicit Bias of Gradient Descent on Separable Data ''The Journal of Machine Learning Research'' 2018 [https://arxiv.org/abs/1710.10345 https://arxiv.org/abs/1710.10345]</ref>
}}
}}

Revision as of 15:12, 10 September 2020

Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.

Basics

A refresher of Machine Learning and Supervised Learning.

Empirical risk minimization (ERM)

Minimize loss function over your data: \(\displaystyle \min_{W} \frac{1}{N} \sum_{i=1}^{N} l(f_{W}(x_i), y_i))\)

Loss functions

For regression, can use quadratic loss: \(\displaystyle l(f_W(x), y) = \frac{1}{2}\Vert f_W(x)-y \Vert^2\)

For 2-way classification, can use hinge-loss: \(\displaystyle l(f_W(x), y) = \max(0, 1-yf_W(x))\)

For multi-way classification, can use cross-entropy loss:
\(\displaystyle g(z)=\frac{1}{1+e^{-z}}\)
\(\displaystyle \min_{W} \left[-\sum_{i=1}^{N}\left[y_i\log(y(f_W(x_i)) + (1-y_i)\log(1-g(f_W(x_i))\right] \right]\)

Nonlinear functions

Given an activation function \(\phi()\), \(\phi w^tx + b\) is a nonlinear function.

Models

Multi-layer perceptron (MLP): Fully-connected feed-forward network.

Convolutional neural network

Optimization

Apply gradient descent or stochastic gradient descent to find \(W^*\).

Stochastic GD:

  1. Sample some batch \(B\)
  2. \(w^{(t+1)} = w^{(t)} - \eta \frac{1}{|B|} \sum_{i \in B} \nabla_{W} l(f_{W}(x_i), y_i)\)

Optimizers/Solvers:

  • Momentum
  • RMSProp
  • Adam

DL Optimization

(Lectures 2-3, September 3)
The role of "over-parameterization".

In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
However, in practice GD & SGD work pretty well.
Lecture 2 (Sept 3) is about Liu et al. [1].

Suppose we have a classification problem with \(c\) labels and \(N\) samples total. This gives us \(n=N*c\) constraints.
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). This is called the over-parameterized regime.

In the exact interpolation regime, we have: \[ \begin{aligned} f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ &\vdots\\ f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ \end{aligned} \] This can be rewritten as \(\displaystyle F(w)=y\) where x is implicit.

In the under-parameterized regime, we get poor locals. (See Fig 1a of [1]).
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.
Over-parameterized models have essential non-convexity in their loss landscape.

Minimizers of convex functions form convex sets.
Manifold of \(w^*\) have some curvature.
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.
So we cannot rely on convex analysis to understand over-parameterized systems.

Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
For \(\displaystyle w \in B\), \(\displaystyle \frac{1}{2}\Vert \nabla L(w) \Vert^2 \geq \mu L(w)\) which implies exponential (linear) convergence of GD.

Tangent Kernels

Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
Then our tangent kernel is:
\[K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\] where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)

Theorem 4.1 (Uniform conditioning implies PL* condition)

If \(\lambda_{\min} (K(w)) \geq \mu\) then \(\mu\text{-PL*}\) on \(B\).

Proof

\(\displaystyle \begin{aligned} \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ &=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ &= \lambda_{\min}(K(w)) L(w)\\ &\geq \mu L(w) \end{aligned} \)

Informal convergence result:
GD converges exp fast to a solution with the rate controlled by a condition number: \[ \kappa_{F} = \frac{\sup_{B} \lambda_{\max}(H)}{\inf_{B} \lambda_{\min}(K)}\] Ideally we want this to be small.

Example


\(\displaystyle F(w) = Aw = y \in \mathbb{R}^n\)
\(\displaystyle L(w) = \frac{1}{2} \Vert Aw-y\Vert^2\)
Our tangent kernel is: \(\displaystyle K(w)=AA^T\) and our hessian is \(\displaystyle H=A^TA\).

\(\displaystyle \kappa_{F} = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)} \geq 1\).

Our intuition is: \(\displaystyle K=AA^T \implies E[\log K_F] \approx \log(\frac{M}{|m-n|+1})\).
So as \(\displaystyle m \to \infty\), \(\displaystyle \kappa_{F} \to 1\).

Theorem 4.2 (Local PL* implies existence of a solution + fast convergence)

Assume \(\displaystyle L(w)\) is \(\beta\)-smooth and satisfies \(\mu\)-PL condition around a ball \(\displaystyle B(w_0, R)\) with \(\displaystyle R = \frac{2\sqrt{w\beta L(w_0)}{\mu}\). Then:

  • Existence of a solution: There exists a global minimum of \(\displaystyle L\), \(\displaystyle \mathbf{w}^* \in B(\mathbf{w}_0, R)\) s.t. \(\displaystyle F(\mathbf{w}^*)=\mathbf{y}\).
  • Convergence of GD: GD with step size \(\displaystyle \eta = 1/\sup_{\mathbf{w}}\lambda_{\max}(H)\) converges with exponential convergence rate:
    \[L(w_t) \leq \left(1-\kappa^{-1}(B(\mathbf{w}, R))\right)^t L(\mathbf{w}_0)\]
Convergence proof:


We want to prove:

  • A solution exists.
  • \(\displaystyle L(w_t) \leq (1-\eta \mu)^t L(w_0)\)

Note that \(\displaystyle \eta \mu\) is the inverse of the condition number.

Beginning of Lecture 3 (September 8, 2020)

Last time (lecture 2) we expect zero training error from overparameterization. This is the interpolation regime.
Here the number of parameters \(m\) is greater than the number of samples \(n=Nc\).
Here the manifold of \(w^*\) is the intersection of functions where \(F(w, x_1)=Y_1\) and \(F(w,x_2)=Y_2\).
Assume we use squared loss \(L(w)=\frac{1}{2}\Vert F(w)-Y \Vert^2\).
In general, our loss function is non convex, even locally.
PL-condition: If the magnitude of gradient is greater than a constant multiple \(\mu\) of loss then we say it is \(\mu\)-PL.
If our loss satisfies PL and some other conditions, then it will converge with gradient descent.

Intuition of PL

Assume:

  1. Loss \(\displaystyle L(w)_{w \in B}\) is \(\mu\)-PL
  2. Loss is \(\beta\)-smooth
  3. The hessian is bounded by \(\beta\)
  4. \(\eta=\frac{1}{\beta}\)
  5. Assume gradient descent: \(w_{t+1}=w_t-\mu \nabla L(w_t)\)

By Taylor's expansion: \[ \begin{aligned} L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w')(w_{t+1}-w_{t})\\ &=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w')(-\eta) \nabla L(w_t)\\ &\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w') \nabla L(w_t)\\ &\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\ &\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\ &\leq L(w_t) - \eta \mu L(w_t) &\text{by }\mu\text{-PL assumption}\\ &= (1-\eta \mu)L(w_t) \end{aligned} \] This implies our loss at any iteration is \(\displaystyle L(w_t) \leq (1-\eta \mu)^t L(w_0)\).
Thus we see a geometric or exponential decrease in our loss function with convergence rate \(\displaystyle (1-\eta \mu)\).
Similar results hold for SGD.

Q&A

If we don't have convergence, we're not sure if this means we're violating the \(\mu\)-PL condition. It is possible one of the other assumptions is violated (e.g. if learning rate is too large).
These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis.

Caveat:
Here we assume our loss is in a ball \(\displaystyle B(w_0, R) = \{w | \Vert w - w_0 \Vert \leq R\}\) where \(\displaystyle R \leq \frac{2\sqrt{2 \beta L(w_0)}}{\mu}\).
We need to show that our gradient updates never causes us to leave this ball.
The lengths of the update vectors are \(\displaystyle \Vert w_{t+1} - w_{t}\Vert = \eta \Vert \nabla L(w_t)\Vert\).
To show that we never leave the ball, we need to have an upper bound on the length of our gradients \(\displaystyle \Vert \nabla L(w_t) \Vert\).
There is a tradeoff because for PL, we want a large \(\displaystyle \mu\) to lower bound the gradients but that would require satisfying PL over a large ball. If \(\displaystyle \mu\) is small, then we have fast (large) updates but over a small ball.
From the proof above, we have: \(\displaystyle L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2\).
We can use this to prove we are in the ball.

Why do neural networks satisfy the conditioning assumptions?

Hessian Control
We can show \(mu\)-PL by showing the smallest eigenvalue of the tangent kernel is bounded: \(\displaystyle \lambda_{\min}(K(w_0)) \geq \mu\) and \(\displaystyle \sup_{B} \Vert H(F)\Vert\). The tangent kernel is \(\displaystyle \nabla F(w) \nabla F(w)^T\). If hessian is bounded then gradients don't change too fast so if we are \(\displaystyle \mu\)-PL at the initialization then we are \(\displaystyle \mu\)-PL in a ball around the initialization.

Suppose we have a NN: \(\displaystyle x \in \mathbb{R} \to y\).
\(\displaystyle f(w, x) = \frac{1}{\sqrt{m}}\sum_{i=1}^{m} v_i \sigma(w_i, x)\).

Can we prove convergence of GD for this NN?

\(\displaystyle \nabla_{w_i} f(w, x) = \frac{1}{\sqrt{m}} v_i x \sigma'(w_i x)\)
\(\displaystyle K(w, x_j, x_j) = \frac{1}{m}\sum_{i=1}^{m} v_i^2 x^2 \left(\sigma'(w_i x)\right)^2 \equiv O(1)\) are the j-th diagonal terms of the tangent kernel: \(\displaystyle K(w) \in \mathbb{R}^{n \times n}\).
Then the trace of the tangent kernel is also \(\displaystyle O(1)\) so \(\displaystyle \Vert K(w) \Vert = O(1)\).
\(\displaystyle H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}\)
The hessian is a diagonal matrix.
\(\displaystyle \Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \max_{i \in [m]} | v_i \sigma '' (w_j x)| = O(\frac{1}{\sqrt{m}})\)
As m goes to infinity, our hessian \(\displaystyle H\) goes to 0 and tangent kernel \(\displaystyle K\) goes to a constant.
Thus Hessian control implies convergence of GD/SGD.

  • However this argument assumes the model is almost linear.
  • This hessian control can be extended to L-layer NN with \(\displaystyle m=\omega(\frac{n}{\mu})\)

Example:
\(\displaystyle g(w, x) = \phi(\frac{1}{\sqrt{m}}\sum_{i=1}^{m}v_i \sigma(w_i x))\) then

  • \(\displaystyle \Vert K(w) \Vert = O(1)\)
  • \(\displaystyle \Vert H \Vert = O(1)\) if \(\displaystyle (\phi'' \neq 0)\)

so we cannot use hessian control.

Lemma

If f is mu-PL then \(\displaystyle \phi \circ f\) is \(\displaystyle \mu \rho^2\)-PL (\(\displaystyle | \phi'(f(w,x))| \gt \phi\)).
implies \(\displaystyle L(w_t) \leq (1- \eta \mu \rho^2)L(w_0)\).
GD converges even though our model does not go to a linear model.

Take-away

Over-parameterization does not lead to linearization. Over-parameterization leads to good conditioning which leads to PL and convergence of GD/SGD.

Other papers:

  • Simon Du et al.[2]

Misc

Visible to::users

Resources

  1. 1.0 1.1 Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning https://arxiv.org/abs/2003.00307
  2. Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh (2019). Gradient Descent Provably Optimizes Over-parameterized Neural Networks (ICLR 2019) https://arxiv.org/abs/1810.02054
Cite error: <ref> tag with name "soudry2018implicit" defined in <references> is not used in prior text.