Deep Learning: Difference between revisions
No edit summary |
|||
Line 442: | Line 442: | ||
Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''. | Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''. | ||
;Intuition: In the over-parameterized regime, there are infinitely many solutions in the manifold of <math>f_{w^*}</math>. | |||
For SGD, it is easier to find ''simple'' solutions (e.g. functions with small norms). This leads to better generalization. | |||
===Can we analyze the double descent curve for some simple distributions or models? (Belkin ''et al.''<ref name="belkin2019reconciling"></ref>)=== | |||
Setup: | |||
Our features are <math>x = (x_1,..., x_d)</math> where <math>x_i</math> are from standard normal. | |||
Our labels are <math>y = x^t \beta</math>. This is the noise-free case. | |||
Our training set: <math>\{(x^{(i)}, y^{(i)})\}_{i=1}^{n}</math> written as <math>X = | |||
\begin{pmatrix} | |||
(x^{(1)})^t\\ | |||
\vdots\\ | |||
(x^{(n)})^t | |||
\end{pmatrix} | |||
</math> | |||
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. | |||
==Misc== | ==Misc== |