Deep Learning: Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi | Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi. | ||
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | ||
| Line 48: | Line 48: | ||
In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | ||
However, in practice GD & SGD work pretty well. | However, in practice GD & SGD work pretty well. | ||
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref> | Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>. | ||
Suppose we have a classification problem with \(c\) labels and \(N\) samples total. | |||
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). | |||
This is called the over-parameterized regime. | |||
In the exact interpolation regime, we have: | |||
\[ | |||
\begin{aligned} | |||
f_W(x_1) &= y_1 = | |||
\begin{bmatrix} | |||
0 \\ \vdots \\ 1 | |||
\end{bmatrix} \in \mathbb{R}^c\\ | |||
&\vdots\\ | |||
f_W(x_1) &= y_1 = | |||
\begin{bmatrix} | |||
0 \\ \vdots \\ 1 | |||
\end{bmatrix} \in \mathbb{R}^c\\ | |||
\end{aligned} | |||
\] | |||
This can be rewritten as <math>F(w)=y</math> where ''x is implicit''. | |||
In the ''under-parameterized regime'', we get poor locals. (See Fig 1a of <ref name="liu2020towards"></ref>). | |||
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized. | |||
Over-parameterized models have ''essential non-convexity'' in their loss landscape. | |||
Minimizers of convex functions form convex sets. | |||
Manifold of \(w^*\) have some curvature. | |||
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions. | |||
So we cannot rely on convex analysis to understand over-parameterized systems. | |||
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963): | |||
For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | |||
Tangent Kernels: | |||
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\). | |||
Then our tangent kernel is: | |||
\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | |||
;Lemma | |||
If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\). | |||
==Misc== | ==Misc== | ||
| Line 56: | Line 96: | ||
{{reflist|refs= | {{reflist|refs= | ||
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin | <ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref> | ||
}} | }} | ||
Revision as of 15:50, 3 September 2020
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.
My notes are intended to be a concise reference for myself, not a comprehensive replacement for lecture.
I may omit portions covered in other wiki pages or things I find less interesting.
Basics
A refresher of Machine Learning and Supervised Learning.
Empirical risk minimization (ERM)
Minimize loss function over your data: \(\displaystyle \min_{W} \frac{1}{N} \sum_{i=1}^{N} l(f_{W}(x_i), y_i))\)
Loss functions
For regression, can use quadratic loss: \(\displaystyle l(f_W(x), y) = \frac{1}{2}\Vert f_W(x)-y \Vert^2\)
For 2-way classification, can use hinge-loss: \(\displaystyle l(f_W(x), y) = \max(0, 1-yf_W(x))\)
For multi-way classification, can use cross-entropy loss:
\(\displaystyle g(z)=\frac{1}{1+e^{-z}}\)
\(\displaystyle \min_{W} \left[-\sum_{i=1}^{N}\left[y_i\log(y(f_W(x_i)) + (1-y_i)\log(1-g(f_W(x_i))\right] \right]\)
Nonlinear functions
Given an activation function \(\phi()\), \(\phi w^tx + b\) is a nonlinear function.
Models
Multi-layer perceptron (MLP): Fully-connected feed-forward network.
Optimization
Apply gradient descent or stochastic gradient descent to find \(W^*\).
Stochastic GD:
- Sample some batch \(B\)
- \(w^{(t+1)} = w^{(t)} - \eta \frac{1}{|B|} \sum_{i \in B} \nabla_{W} l(f_{W}(x_i), y_i)\)
Optimizers/Solvers:
- Momentum
- RMSProp
- Adam
DL Optimization
The role of "over-parameterization".
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
However, in practice GD & SGD work pretty well.
Lecture 2 (Sept 3) is about Liu et al. [1].
Suppose we have a classification problem with \(c\) labels and \(N\) samples total.
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
This is called the over-parameterized regime.
In the exact interpolation regime, we have: \[ \begin{aligned} f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ &\vdots\\ f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ \end{aligned} \] This can be rewritten as \(\displaystyle F(w)=y\) where x is implicit.
In the under-parameterized regime, we get poor locals. (See Fig 1a of [1]).
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.
Over-parameterized models have essential non-convexity in their loss landscape.
Minimizers of convex functions form convex sets.
Manifold of \(w^*\) have some curvature.
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.
So we cannot rely on convex analysis to understand over-parameterized systems.
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
For \(\displaystyle w \in B\), \(\displaystyle \Vert \nabla L(w) \Vert^2 \leq \mu L(w)\) which implies exponential (linear) convergence of GD.
Tangent Kernels:
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
Then our tangent kernel is:
\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
- Lemma
If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).
Misc
Resources
<templatestyles src="Reflist/styles.css" />
- ↑ 1.0 1.1 Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning https://arxiv.org/abs/2003.00307