Jump to content

Deep Learning: Difference between revisions

From David's Wiki
No edit summary
No edit summary
Line 1: Line 1:
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.


* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
Line 48: Line 48:
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).   
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).   
However, in practice GD & SGD work pretty well.   
However, in practice GD & SGD work pretty well.   
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.
 
Suppose we have a classification problem with \(c\) labels and \(N\) samples total. 
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
This is called the over-parameterized regime.
 
In the exact interpolation regime, we have:
\[
\begin{aligned}
f_W(x_1) &= y_1 =
\begin{bmatrix}
0 \\ \vdots \\ 1
\end{bmatrix} \in \mathbb{R}^c\\
&\vdots\\
f_W(x_1) &= y_1 =
\begin{bmatrix}
0 \\ \vdots \\ 1
\end{bmatrix} \in \mathbb{R}^c\\
\end{aligned}
\]
This can be rewritten as <math>F(w)=y</math> where ''x is implicit''.
 
In the ''under-parameterized regime'', we get poor locals. (See Fig 1a of <ref name="liu2020towards"></ref>). 
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized. 
Over-parameterized models have ''essential non-convexity'' in their loss landscape. 
 
Minimizers of convex functions form convex sets. 
Manifold of \(w^*\) have some curvature. 
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions. 
So we cannot rely on convex analysis to understand over-parameterized systems.
 
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963): 
For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
 
Tangent Kernels:
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\). 
Then our tangent kernel is: 
\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
 
;Lemma
If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).


==Misc==
==Misc==
Line 56: Line 96:


{{reflist|refs=
{{reflist|refs=
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin. (2020) Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
}}
}}

Revision as of 15:50, 3 September 2020

Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.

My notes are intended to be a concise reference for myself, not a comprehensive replacement for lecture.
I may omit portions covered in other wiki pages or things I find less interesting.

Basics

A refresher of Machine Learning and Supervised Learning.

Empirical risk minimization (ERM)

Minimize loss function over your data: \(\displaystyle \min_{W} \frac{1}{N} \sum_{i=1}^{N} l(f_{W}(x_i), y_i))\)

Loss functions

For regression, can use quadratic loss: \(\displaystyle l(f_W(x), y) = \frac{1}{2}\Vert f_W(x)-y \Vert^2\)

For 2-way classification, can use hinge-loss: \(\displaystyle l(f_W(x), y) = \max(0, 1-yf_W(x))\)

For multi-way classification, can use cross-entropy loss:
\(\displaystyle g(z)=\frac{1}{1+e^{-z}}\)
\(\displaystyle \min_{W} \left[-\sum_{i=1}^{N}\left[y_i\log(y(f_W(x_i)) + (1-y_i)\log(1-g(f_W(x_i))\right] \right]\)

Nonlinear functions

Given an activation function \(\phi()\), \(\phi w^tx + b\) is a nonlinear function.

Models

Multi-layer perceptron (MLP): Fully-connected feed-forward network.

Convolutional neural network

Optimization

Apply gradient descent or stochastic gradient descent to find \(W^*\).

Stochastic GD:

  1. Sample some batch \(B\)
  2. \(w^{(t+1)} = w^{(t)} - \eta \frac{1}{|B|} \sum_{i \in B} \nabla_{W} l(f_{W}(x_i), y_i)\)

Optimizers/Solvers:

  • Momentum
  • RMSProp
  • Adam

DL Optimization

The role of "over-parameterization".
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
However, in practice GD & SGD work pretty well.
Lecture 2 (Sept 3) is about Liu et al. [1].

Suppose we have a classification problem with \(c\) labels and \(N\) samples total.
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). This is called the over-parameterized regime.

In the exact interpolation regime, we have: \[ \begin{aligned} f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ &\vdots\\ f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ \end{aligned} \] This can be rewritten as \(\displaystyle F(w)=y\) where x is implicit.

In the under-parameterized regime, we get poor locals. (See Fig 1a of [1]).
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.
Over-parameterized models have essential non-convexity in their loss landscape.

Minimizers of convex functions form convex sets.
Manifold of \(w^*\) have some curvature.
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.
So we cannot rely on convex analysis to understand over-parameterized systems.

Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
For \(\displaystyle w \in B\), \(\displaystyle \Vert \nabla L(w) \Vert^2 \leq \mu L(w)\) which implies exponential (linear) convergence of GD.

Tangent Kernels: Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
Then our tangent kernel is:
\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)

Lemma

If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).

Misc

Visible to::users

Resources

<templatestyles src="Reflist/styles.css" />

  1. 1.0 1.1 Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning https://arxiv.org/abs/2003.00307