5,337
edits
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi | Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi. | ||
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | ||
Line 48: | Line 48: | ||
In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | In general, you can have poor local minimums and saddle points (with pos+neg Hessian). | ||
However, in practice GD & SGD work pretty well. | However, in practice GD & SGD work pretty well. | ||
Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref> | Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>. | ||
Suppose we have a classification problem with \(c\) labels and \(N\) samples total. | |||
Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\). | |||
This is called the over-parameterized regime. | |||
In the exact interpolation regime, we have: | |||
\[ | |||
\begin{aligned} | |||
f_W(x_1) &= y_1 = | |||
\begin{bmatrix} | |||
0 \\ \vdots \\ 1 | |||
\end{bmatrix} \in \mathbb{R}^c\\ | |||
&\vdots\\ | |||
f_W(x_1) &= y_1 = | |||
\begin{bmatrix} | |||
0 \\ \vdots \\ 1 | |||
\end{bmatrix} \in \mathbb{R}^c\\ | |||
\end{aligned} | |||
\] | |||
This can be rewritten as <math>F(w)=y</math> where ''x is implicit''. | |||
In the ''under-parameterized regime'', we get poor locals. (See Fig 1a of <ref name="liu2020towards"></ref>). | |||
The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized. | |||
Over-parameterized models have ''essential non-convexity'' in their loss landscape. | |||
Minimizers of convex functions form convex sets. | |||
Manifold of \(w^*\) have some curvature. | |||
Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions. | |||
So we cannot rely on convex analysis to understand over-parameterized systems. | |||
Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963): | |||
For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD. | |||
Tangent Kernels: | |||
Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\). | |||
Then our tangent kernel is: | |||
\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\) | |||
;Lemma | |||
If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\). | |||
==Misc== | ==Misc== | ||
Line 56: | Line 96: | ||
{{reflist|refs= | {{reflist|refs= | ||
<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin | <ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref> | ||
}} | }} |