Deep Learning: Difference between revisions

No edit summary
Line 442: Line 442:


Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''.
Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''.
;Intuition: In the over-parameterized regime, there are infinitely many solutions in the manifold of <math>f_{w^*}</math>. 
For SGD, it is easier to find ''simple'' solutions (e.g. functions with small norms). This leads to better generalization.
===Can we analyze the double descent curve for some simple distributions or models? (Belkin ''et al.''<ref name="belkin2019reconciling"></ref>)===
Setup: 
Our features are <math>x = (x_1,..., x_d)</math> where <math>x_i</math> are from standard normal. 
Our labels are <math>y = x^t \beta</math>. This is the noise-free case. 
Our training set: <math>\{(x^{(i)}, y^{(i)})\}_{i=1}^{n}</math> written as <math>X =
\begin{pmatrix}
(x^{(1)})^t\\
\vdots\\
(x^{(n)})^t
\end{pmatrix}
</math> 
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>.


==Misc==
==Misc==