Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 15:53, 15 September 2020

3,289 bytes added , 15 September 2020

→‎Takeaway

David

Bureaucrats, Interface administrators, Administrators

5,340

edits

@@ Line 342: / Line 342: @@
 * Implicit bias exists in over-parameterized neural networks.
 ** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error.
+==DL Generalization==
+Training set \(S = \{z^{(i)} = (x^{(i)}, y^{(i)}) \}_{i=1}^{n}\).
+Our samples are from some unknown distribution: \(Z^{(i)} = (x^{(i)}, y^{(i)}) \sim P_{X, Y}\).
+\(\min_{h \in H} L_{S}(h) \to h_{S}^{*}\)
+In practice, we cannot compute our population loss: <math>L_{D}(h) = E_{z \in P_{X, Y}} [l(h, z)]</math>.
+<math>\min{h \in H} L_{D}(h) \to h_{D}^*</math>.
+The goal is to learn something from the training set which has a low population loss.
+The estimation error is <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> where <math>L_{D}(h_s^*)</math> is the true population loss of the hypothesis gained from the test set and <math>L_{D}(h_D^*)</math> is the minimum population loss.
+===Classical Learning Theory===
+If <math>\forall P_{X, Y}, n \geq n_0</math>,
+<math>Pr(L_{D}(h_{S}^*) - L_{D}(h_{D}^*) \leq \epsilon) \geq 1-\delta</math> then is H called PAC-learnable.
+===Bias-variance tradeoff===
+<math>L_{D}(h_{S}^*) = L_{D}(h_{D}^*) + [L_{D}(h_{S}^*) - L_{D}(h_{D}^*)]</math>
+* <math>L_{D}(h_{D}^*)</math> is called the bias term, the loss of the best hypothesis in your hypothesis class. If the complexity of <math>H</math> increases, then the bias term will decrease.
+* <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> is the estimation error or variance term.
+* As the complexity of <math>H</math> increases, the bias decreases but the variance increases.
+However, this view is not complete.
+;Notations:
+<math>f = l \circ h</math>
+<math>L_{S}(h) = \frac{1}{n} \sum_{i=1}^{n} f(z^{(i)})</math>
+<math>L_{D}(h) = E_{Z \sim P_{X, Y}}[ f(z) ] </math>
+<math>F = l \circ H = \{ l \circ h, \forall h \in H \}</math>
+<math>F \circ S = \left\{
+\begin{bmatrix}
+f(z^{(1)})\\
+\vdots\\
+f(z^{(n)})
+\end{bmatrix}, \forall f \in F \right\}</math>
+Generalization error: <math>L_{D}(h_{S}^*) - L_{S}(h_{S}^*)</math>
+We want to have a uniform bound on this error: <math>G = \sup_{h \in H} L_{D}(h) - L_{S}(h)</math>.
+We don't have the population distribution <math>D</math>.
+Instead, what we do is split the training data: <math>S = S_1, \cup S_2</math> with <math>|S_1|=|S_2|=\frac{n}{2}</math>.
+Now we have:<br>
+<math>
+\begin{aligned}
+G \approx G' &=\sup_{h \in H} L_{S_1}(h) - L_{S_2}(h)\\
+&= \sup_{h \in H} F(s_1) - F(s_2)\\
+&= \sup_{h \in H} \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)}) - \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)})\\
+&= \sup_{h \in H} \frac{2}{n} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) & \text{where }\sigma_{i} = I[z^{(i)} \in S_1] - I[z^{(i)} \in S_2]
+\end{aligned}
+</math>
+We can use randomized partitioning such that
+<math>\sigma =
+\begin{cases}
+& \text{w.p. }1/2\\
+-1 & \text{w.p. }1/2
+\end{cases}
+</math>
+<math>G \approx E_{\sigma}[G'] = \frac{2}{n} E_{\sigma} \left[ \sup_{h \in H} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) \right]</math>
+<math>R(A) = \frac{1}{n} E_{\sigma} \left[ \sup _a \in A \sum_{i=1}^{n} \sigma_i a_i \right]</math>
+This is called the Rademacher complexity.
+<math>\implies G \approx 2 R(F \circ S)</math>
+;Theorem
+With probability <math>1 - \delta</math>,
+<math>L_D(h) - L_{S}(h) \leq 2 R(F \circ S) + c \sqrt{\frac{\log(4/\delta)}{n}}</math>
+Example: <math>H = \{ h(x) = w^t x, \Vert w \Vert_2 \leq 1\}</math>
+<math>R(H \circ S) \leq \frac{\max_i \Vert x^{(i)} \Vert_2}{\sqrt{n}}</math>
 ==Misc==