Deep Learning: Difference between revisions

Line 418:

<math>H = \{ h(x) \mid h \text{ is a NN with some structure}\}</math>

If <math>R(H \circ S)</math> is small then by the theorem, we can have good generalization performance.

Zhang ''et al.''<ref name="zhang2017understanding"></ref> perform a randomization test.

They assign random labels and observe that neural networks can fit random labels.

Recall <math>R(H \circ S) = \frac{1}{n} E_{\sigma} \left[ \sup_{h \in H} \sum_{i=1}^{n} \sigma_i h(x_i) \right] \approx 1</math>

This shows that Rademacher complexity and VC-dimension are not useful for explaining generalization for neural networks.

;Theorem

There exists a two-layer NN with Relu activations and <math>2n+d</math> parameters that can represent any function on a sample size <math>n</math> in d dimensions.

==Misc==

Line 428:

Line 436:

<ref name="du2019gradient">Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh (2019). Gradient Descent Provably Optimizes Over-parameterized Neural Networks (ICLR 2019) [https://arxiv.org/abs/1810.02054 https://arxiv.org/abs/1810.02054]</ref>

<ref name="soudry2018implicit">Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018) The Implicit Bias of Gradient Descent on Separable Data ''The Journal of Machine Learning Research'' 2018 [https://arxiv.org/abs/1710.10345 https://arxiv.org/abs/1710.10345]</ref>

<ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref>

}}

@@ Line 418: / Line 418: @@
 <math>H = \{ h(x) \mid h \text{ is a NN with some structure}\}</math>
 If <math>R(H \circ S)</math> is small then by the theorem, we can have good generalization performance.
+Zhang ''et al.''<ref name="zhang2017understanding"></ref> perform a randomization test.
+They assign random labels and observe that neural networks can fit random labels.
+Recall <math>R(H \circ S) = \frac{1}{n} E_{\sigma} \left[ \sup_{h \in H} \sum_{i=1}^{n} \sigma_i h(x_i) \right] \approx 1</math>
+This shows that Rademacher complexity and VC-dimension are not useful for explaining generalization for neural networks.
+;Theorem
+There exists a two-layer NN with Relu activations and <math>2n+d</math> parameters that can represent any function on a sample size <math>n</math> in d dimensions.
 ==Misc==
@@ Line 428: / Line 436: @@
 <ref name="du2019gradient">Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh (2019). Gradient Descent Provably Optimizes Over-parameterized Neural Networks (ICLR 2019) [https://arxiv.org/abs/1810.02054 https://arxiv.org/abs/1810.02054]</ref>
 <ref name="soudry2018implicit">Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro (2018) The Implicit Bias of Gradient Descent on Separable Data ''The Journal of Machine Learning Research'' 2018 [https://arxiv.org/abs/1710.10345 https://arxiv.org/abs/1710.10345]</ref>
+<ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref>
 }}