Deep Learning: Difference between revisions

Line 1:

Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi

Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.

* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]

Line 48:

In general, you can have poor local minimums and saddle points (with pos+neg Hessian).

However, in practice GD & SGD work pretty well.

Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>

Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.

Suppose we have a classification problem with \(c\) labels and \(N\) samples total.

Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).

This is called the over-parameterized regime.

In the exact interpolation regime, we have:

\[

\begin{aligned}

f_W(x_1) &= y_1 =

\begin{bmatrix}

0 \\ \vdots \\ 1

\end{bmatrix} \in \mathbb{R}^c\\

&\vdots\\

f_W(x_1) &= y_1 =

\begin{bmatrix}

0 \\ \vdots \\ 1

\end{bmatrix} \in \mathbb{R}^c\\

\end{aligned}

\]

This can be rewritten as <math>F(w)=y</math> where ''x is implicit''.

In the ''under-parameterized regime'', we get poor locals. (See Fig 1a of <ref name="liu2020towards"></ref>).

The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.

Over-parameterized models have ''essential non-convexity'' in their loss landscape.

Minimizers of convex functions form convex sets.

Manifold of \(w^*\) have some curvature.

Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.

So we cannot rely on convex analysis to understand over-parameterized systems.

Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):

For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.

Tangent Kernels:

Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).

Then our tangent kernel is:

\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)

;Lemma

If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).

==Misc==

Line 56:

Line 96:

{{reflist|refs=

<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin. (2020) Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>

<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>

}}

@@ Line 1: / Line 1: @@
-Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi
+Notes for CMSC 828W: Foundations of Deep Learning (Fall 2020) taught by Soheil Feizi.
 * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
@@ Line 48: / Line 48: @@
 In general, you can have poor local minimums and saddle points (with pos+neg Hessian).
 However, in practice GD & SGD work pretty well.
-Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>
+Lecture 2 (Sept 3) is about Liu ''et al.'' <ref name="liu2020towards></ref>.
+Suppose we have a classification problem with \(c\) labels and \(N\) samples total.
+Interpolation is possible if the number of parameters \(m\) is greater than the number of samples \(n=N*c\).
+This is called the over-parameterized regime.
+In the exact interpolation regime, we have:
+\[
+\begin{aligned}
+f_W(x_1) &= y_1 =
+\begin{bmatrix}
+\\ \vdots \\ 1
+\end{bmatrix} \in \mathbb{R}^c\\
+&\vdots\\
+f_W(x_1) &= y_1 =
+\begin{bmatrix}
+\\ \vdots \\ 1
+\end{bmatrix} \in \mathbb{R}^c\\
+\end{aligned}
+\]
+This can be rewritten as <math>F(w)=y</math> where ''x is implicit''.
+In the ''under-parameterized regime'', we get poor locals. (See Fig 1a of <ref name="liu2020towards"></ref>).
+The poor-locals are locally convex in the under-parameterized regime but not in the over-parameterized.
+Over-parameterized models have ''essential non-convexity'' in their loss landscape.
+Minimizers of convex functions form convex sets.
+Manifold of \(w^*\) have some curvature.
+Manifold of \(w^*\) should be an affine subspace. This is in general for non-linear functions.
+So we cannot rely on convex analysis to understand over-parameterized systems.
+Instead of convexity, we use PL-condition (Polyak-Lojasiewicz, 1963):
+For <math>w \in B</math>, <math>\Vert \nabla L(w) \Vert^2 \leq \mu L(w)</math> which implies exponential (linear) convergence of GD.
+Tangent Kernels:
+Suppose our model is \(F(w)=y\) where \(w \in \mathbb{R}^m\) and \(y \in \mathbb{R}^n\).
+Then our tangent kernel is:
+\(K(w) = \nabla F(w) \nabla F(w)^T \in \mathbb{R}^{n \times n}\) where \(\nabla F(w) \in \mathbb{R}^{n \times m}\)
+;Lemma
+If \(\lambda \min K(w) \geq \mu \implies \mu\text{-PL}\) on \(B\).
 ==Misc==
@@ Line 56: / Line 96: @@
 {{reflist|refs=
-<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin. (2020) Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning. [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
+<ref name="liu2020towards">Chaoyue Liu, Libin Zhu, Mikhail Belkin (2020). Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning [https://arxiv.org/abs/2003.00307 https://arxiv.org/abs/2003.00307]</ref>
 }}