Deep Learning: Difference between revisions

Deep Learning (view source)

938 bytes added , 17 September 2020

5,321

edits

@@ Line 442: / Line 442: @@
 Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''.
+;Intuition: In the over-parameterized regime, there are infinitely many solutions in the manifold of <math>f_{w^*}</math>.
+For SGD, it is easier to find ''simple'' solutions (e.g. functions with small norms). This leads to better generalization.
+===Can we analyze the double descent curve for some simple distributions or models? (Belkin ''et al.''<ref name="belkin2019reconciling"></ref>)===
+Setup:
+Our features are <math>x = (x_1,..., x_d)</math> where <math>x_i</math> are from standard normal.
+Our labels are <math>y = x^t \beta</math>. This is the noise-free case.
+Our training set: <math>\{(x^{(i)}, y^{(i)})\}_{i=1}^{n}</math> written as <math>X =
+\begin{pmatrix}
+(x^{(1)})^t\\
+\vdots\\
+(x^{(n)})^t
+\end{pmatrix}
+</math>
+Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>.
 ==Misc==