Deep Learning: Difference between revisions
Line 89: | Line 89: | ||
;Lemma | ;Lemma | ||
If \(\lambda_{\min} K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\). | If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\). | ||
{{ hidden | Proof | | |||
<math> | |||
\begin{aligned} | |||
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ | |||
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ | |||
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | |||
&= \lambda_{\min}(K(w)) L(w) | |||
\end{aligned} | |||
</math> | |||
}} | |||
Informal convergence result: | |||
GD converges exp fast to a solution with the rate controlled by a condition number: | |||
\[ \kappa_{F} = \frac{\sup_{B} \lambda_{\max}(H)}{\inf_{B} \lambda_{\min}(K)}\] | |||
Ideally we want this to be small. | |||
{{ hidden | Example | | |||
<math>F(w) = Aw = y \in \mathbb{R}^n</math> | |||
<math>L(w) = \frac{1}{2} \Vert Aw-y\Vert^2</math> | |||
Our tangent kernel is: | |||
<math>K(w)=AA^T</math> | |||
and our hessian is <math>H=A^TA</math>. | |||
<math>\kappa_{F} = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)} \geq 1</math>. | |||
Our intuition is: | |||
<math>K=AA^T \implies E[\log K_F] \approx \log(\frac{M}{|m-n|+1})</math>. | |||
So as <math>m \to \infty</math>, <math>\kappa_{F} \to 1</math>. | |||
}} | |||
{{hidden | Convergence proof: (Theorem 4.2 <ref name="liu2020towards"></ref>) | | |||
Local PL implies existence of a solution + fast convergence. | |||
Assume <math>L(w)</math> is <math display="inline">\beta</math>-smooth and satisfies <math display="inline">\mu</math>-PL condition. | |||
Assume these are true around a ball <math>B(w_0, R)</math> with <math>R = \frac{2\sqrt{w\beta L(w_0)}{\mu}</math>. | |||
We want to prove: | |||
* A solution exists. | |||
* <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math> | |||
Note that <math>\eta \mu</math> is the inverse of the condition number. | |||
}} | |||
==Misc== | ==Misc== |
Revision as of 16:17, 3 September 2020
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.
My notes are intended to be a concise reference for myself, not a comprehensive replacement for lecture.
I may omit portions covered in other wiki pages or things I find less interesting.
Basics
A refresher of Machine Learning and Supervised Learning.
Empirical risk minimization (ERM)
Minimize loss function over your data: \(\displaystyle \min_{W} \frac{1}{N} \sum_{i=1}^{N} l(f_{W}(x_i), y_i))\)
Loss functions
For regression, can use quadratic loss: \(\displaystyle l(f_W(x), y) = \frac{1}{2}\Vert f_W(x)-y \Vert^2\)
For 2-way classification, can use hinge-loss: \(\displaystyle l(f_W(x), y) = \max(0, 1-yf_W(x))\)
For multi-way classification, can use cross-entropy loss:
\(\displaystyle g(z)=\frac{1}{1+e^{-z}}\)
\(\displaystyle \min_{W} \left[-\sum_{i=1}^{N}\left[y_i\log(y(f_W(x_i)) + (1-y_i)\log(1-g(f_W(x_i))\right] \right]\)
Nonlinear functions
Given an activation function \(\phi()\), \(\phi w^tx + b\) is a nonlinear function.
Models
Multi-layer perceptron (MLP): Fully-connected feed-forward network.
Optimization
Apply gradient descent or stochastic gradient descent to find \(W^*\).
Stochastic GD:
- Sample some batch \(B\)
- \(w^{(t+1)} = w^{(t)} - \eta \frac{1}{|B|} \sum_{i \in B} \nabla_{W} l(f_{W}(x_i), y_i)\)
Optimizers/Solvers:
- Momentum
- RMSProp
- Adam
DL Optimization
The role of "over-parameterization".
In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
However, in practice GD & SGD work pretty well.
Lecture 2 (Sept 3) is about Liu et al. [1].
Suppose we have a classification problem with \(c\) labels and \(N\) samples total.
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
This is called the over-parameterized regime.
In the exact interpolation regime, we have: \[ \begin{aligned} f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ &\vdots\\ f_W(x_1) &= y_1 = \begin{bmatrix} 0 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^c\\ \end{aligned} \] This can be rewritten as \(\displaystyle F(w)=y\) where x is implicit.
In the under-parameterized regime, we get poor locals. (See Fig 1a of [1]).
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.
Over-parameterized models have essential non-convexity in their loss landscape.
Minimizers of convex functions form convex sets.
Manifold of \(w^*\) have some curvature.
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.
So we cannot rely on convex analysis to understand over-parameterized systems.
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
For \(\displaystyle w \in B\), \(\displaystyle \Vert \nabla L(w) \Vert^2 \leq \mu L(w)\) which implies exponential (linear) convergence of GD.
Tangent Kernels:
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
Then our tangent kernel is:
\[K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\]
where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
- Lemma
If \(\lambda_{\min} (K(w)) \geq \mu \implies \mu\text{-PL}\) on \(B\).
\(\displaystyle \begin{aligned} \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ &=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ &= \lambda_{\min}(K(w)) L(w) \end{aligned} \)
Informal convergence result:
GD converges exp fast to a solution with the rate controlled by a condition number:
\[ \kappa_{F} = \frac{\sup_{B} \lambda_{\max}(H)}{\inf_{B} \lambda_{\min}(K)}\]
Ideally we want this to be small.
\(\displaystyle F(w) = Aw = y \in \mathbb{R}^n\)
\(\displaystyle L(w) = \frac{1}{2} \Vert Aw-y\Vert^2\)
Our tangent kernel is:
\(\displaystyle K(w)=AA^T\)
and our hessian is \(\displaystyle H=A^TA\).
\(\displaystyle \kappa_{F} = \frac{\lambda_{\max}(K)}{\lambda_{\min}(K)} \geq 1\).
Our intuition is:
\(\displaystyle K=AA^T \implies E[\log K_F] \approx \log(\frac{M}{|m-n|+1})\).
So as \(\displaystyle m \to \infty\), \(\displaystyle \kappa_{F} \to 1\).
Local PL implies existence of a solution + fast convergence.
Assume \(\displaystyle L(w)\) is \(\beta\)-smooth and satisfies \(\mu\)-PL condition.
Assume these are true around a ball \(\displaystyle B(w_0, R)\) with \(\displaystyle R = \frac{2\sqrt{w\beta L(w_0)}{\mu}\).
We want to prove:
- A solution exists.
- \(\displaystyle L(w_t) \leq (1-\eta \mu)^t L(w_0)\)
Note that \(\displaystyle \eta \mu\) is the inverse of the condition number.
Misc
Resources
<templatestyles src="Reflist/styles.css" />
- ↑ 1.0 1.1 1.2 Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning https://arxiv.org/abs/2003.00307