5,340
edits
Line 342: | Line 342: | ||
* Implicit bias exists in over-parameterized neural networks. | * Implicit bias exists in over-parameterized neural networks. | ||
** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error. | ** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error. | ||
==DL Generalization== | |||
Training set \(S = \{z^{(i)} = (x^{(i)}, y^{(i)}) \}_{i=1}^{n}\). | |||
Our samples are from some unknown distribution: \(Z^{(i)} = (x^{(i)}, y^{(i)}) \sim P_{X, Y}\). | |||
\(\min_{h \in H} L_{S}(h) \to h_{S}^{*}\) | |||
In practice, we cannot compute our population loss: <math>L_{D}(h) = E_{z \in P_{X, Y}} [l(h, z)]</math>. | |||
<math>\min{h \in H} L_{D}(h) \to h_{D}^*</math>. | |||
The goal is to learn something from the training set which has a low population loss. | |||
The estimation error is <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> where <math>L_{D}(h_s^*)</math> is the true population loss of the hypothesis gained from the test set and <math>L_{D}(h_D^*)</math> is the minimum population loss. | |||
===Classical Learning Theory=== | |||
If <math>\forall P_{X, Y}, n \geq n_0</math>, | |||
<math>Pr(L_{D}(h_{S}^*) - L_{D}(h_{D}^*) \leq \epsilon) \geq 1-\delta</math> then is H called PAC-learnable. | |||
===Bias-variance tradeoff=== | |||
<math>L_{D}(h_{S}^*) = L_{D}(h_{D}^*) + [L_{D}(h_{S}^*) - L_{D}(h_{D}^*)]</math> | |||
* <math>L_{D}(h_{D}^*)</math> is called the bias term, the loss of the best hypothesis in your hypothesis class. If the complexity of <math>H</math> increases, then the bias term will decrease. | |||
* <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> is the estimation error or variance term. | |||
* As the complexity of <math>H</math> increases, the bias decreases but the variance increases. | |||
However, this view is not complete. | |||
;Notations: | |||
<math>f = l \circ h</math> | |||
<math>L_{S}(h) = \frac{1}{n} \sum_{i=1}^{n} f(z^{(i)})</math> | |||
<math>L_{D}(h) = E_{Z \sim P_{X, Y}}[ f(z) ] </math> | |||
<math>F = l \circ H = \{ l \circ h, \forall h \in H \}</math> | |||
<math>F \circ S = \left\{ | |||
\begin{bmatrix} | |||
f(z^{(1)})\\ | |||
\vdots\\ | |||
f(z^{(n)}) | |||
\end{bmatrix}, \forall f \in F \right\}</math> | |||
Generalization error: <math>L_{D}(h_{S}^*) - L_{S}(h_{S}^*)</math> | |||
We want to have a uniform bound on this error: <math>G = \sup_{h \in H} L_{D}(h) - L_{S}(h)</math>. | |||
We don't have the population distribution <math>D</math>. | |||
Instead, what we do is split the training data: <math>S = S_1, \cup S_2</math> with <math>|S_1|=|S_2|=\frac{n}{2}</math>. | |||
Now we have:<br> | |||
<math> | |||
\begin{aligned} | |||
G \approx G' &=\sup_{h \in H} L_{S_1}(h) - L_{S_2}(h)\\ | |||
&= \sup_{h \in H} F(s_1) - F(s_2)\\ | |||
&= \sup_{h \in H} \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)}) - \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)})\\ | |||
&= \sup_{h \in H} \frac{2}{n} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) & \text{where }\sigma_{i} = I[z^{(i)} \in S_1] - I[z^{(i)} \in S_2] | |||
\end{aligned} | |||
</math> | |||
We can use randomized partitioning such that | |||
<math>\sigma = | |||
\begin{cases} | |||
1 & \text{w.p. }1/2\\ | |||
-1 & \text{w.p. }1/2 | |||
\end{cases} | |||
</math> | |||
<math>G \approx E_{\sigma}[G'] = \frac{2}{n} E_{\sigma} \left[ \sup_{h \in H} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) \right]</math> | |||
<math>R(A) = \frac{1}{n} E_{\sigma} \left[ \sup _a \in A \sum_{i=1}^{n} \sigma_i a_i \right]</math> | |||
This is called the Rademacher complexity. | |||
<math>\implies G \approx 2 R(F \circ S)</math> | |||
;Theorem | |||
With probability <math>1 - \delta</math>, | |||
<math>L_D(h) - L_{S}(h) \leq 2 R(F \circ S) + c \sqrt{\frac{\log(4/\delta)}{n}}</math> | |||
Example: <math>H = \{ h(x) = w^t x, \Vert w \Vert_2 \leq 1\}</math> | |||
<math>R(H \circ S) \leq \frac{\max_i \Vert x^{(i)} \Vert_2}{\sqrt{n}}</math> | |||
==Misc== | ==Misc== |