Deep Learning: Difference between revisions

Line 437:

* Dropout

[[File:Belkin2019reconciling fig1.png.png|500px|thumb|Figure 1 from Belkin et al. In the over-parameterized interpolation regime, more parameters leads to lower test errors. This is called ''double descent''.]]

These types of explicit regularization improves generalization, but models still generalize well without them.

One reason would be ''implicit regularization'' by SGD.

Belkin ''et al.''

Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''.

==Misc==

Line 452:

Line 453:

<ref name="soudry2018implicit">Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018) The Implicit Bias of Gradient Descent on Separable Data ''The Journal of Machine Learning Research'' 2018 [https://arxiv.org/abs/1710.10345 https://arxiv.org/abs/1710.10345]</ref>

<ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref>

<ref name="belkin2019reconciling">Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (2019) Reconciling modern machine learning practice and the bias-variance trade-off (PNAS 2019) [https://arxiv.org/abs/1812.11118 https://arxiv.org/abs/1812.11118]</ref>

}}

@@ Line 437: / Line 437: @@
 * Dropout
+[[File:Belkin2019reconciling fig1.png.png|500px|thumb|Figure 1 from Belkin et al. In the over-parameterized interpolation regime, more parameters leads to lower test errors. This is called ''double descent''.]]
 These types of explicit regularization improves generalization, but models still generalize well without them.
 One reason would be ''implicit regularization'' by SGD.
-Belkin ''et al.''
+Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''.
 ==Misc==
@@ Line 452: / Line 453: @@
 <ref name="soudry2018implicit">Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018) The Implicit Bias of Gradient Descent on Separable Data ''The Journal of Machine Learning Research'' 2018 [https://arxiv.org/abs/1710.10345 https://arxiv.org/abs/1710.10345]</ref>
 <ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref>
+<ref name="belkin2019reconciling">Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (2019) Reconciling modern machine learning practice and the bias-variance trade-off (PNAS 2019) [https://arxiv.org/abs/1812.11118 https://arxiv.org/abs/1812.11118]</ref>
 }}