Deep Learning: Difference between revisions

3,289 bytes added ,  15 September 2020
Line 342: Line 342:
* Implicit bias exists in over-parameterized neural networks.
* Implicit bias exists in over-parameterized neural networks.
** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error.
** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error.
==DL Generalization==
Training set \(S = \{z^{(i)} = (x^{(i)}, y^{(i)}) \}_{i=1}^{n}\). 
Our samples are from some unknown distribution: \(Z^{(i)} = (x^{(i)}, y^{(i)}) \sim P_{X, Y}\).
\(\min_{h \in H} L_{S}(h) \to h_{S}^{*}\) 
In practice, we cannot compute our population loss: <math>L_{D}(h) = E_{z \in P_{X, Y}} [l(h, z)]</math>. 
<math>\min{h \in H} L_{D}(h) \to h_{D}^*</math>. 
The goal is to learn something from the training set which has a low population loss.
The estimation error is <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> where <math>L_{D}(h_s^*)</math> is the true population loss of the hypothesis gained from the test set and <math>L_{D}(h_D^*)</math> is the minimum population loss.
===Classical Learning Theory===
If <math>\forall P_{X, Y}, n \geq n_0</math>,
<math>Pr(L_{D}(h_{S}^*) - L_{D}(h_{D}^*) \leq \epsilon) \geq 1-\delta</math> then is H called PAC-learnable.
===Bias-variance tradeoff===
<math>L_{D}(h_{S}^*) = L_{D}(h_{D}^*) + [L_{D}(h_{S}^*) - L_{D}(h_{D}^*)]</math>
* <math>L_{D}(h_{D}^*)</math> is called the bias term, the loss of the best hypothesis in your hypothesis class. If the complexity of <math>H</math> increases, then the bias term will decrease.
* <math>L_{D}(h_{S}^*) - L_{D}(h_{D}^*)</math> is the estimation error or variance term.
* As the complexity of <math>H</math> increases, the bias decreases but the variance increases.
However, this view is not complete.
;Notations:
<math>f = l \circ h</math> 
<math>L_{S}(h) = \frac{1}{n} \sum_{i=1}^{n} f(z^{(i)})</math> 
<math>L_{D}(h) = E_{Z \sim P_{X, Y}}[ f(z) ] </math> 
<math>F = l \circ H = \{ l \circ h, \forall h \in H \}</math> 
<math>F \circ S = \left\{
\begin{bmatrix}
f(z^{(1)})\\
\vdots\\
f(z^{(n)})
\end{bmatrix}, \forall f \in F \right\}</math>
Generalization error: <math>L_{D}(h_{S}^*) - L_{S}(h_{S}^*)</math> 
We want to have a uniform bound on this error: <math>G = \sup_{h \in H} L_{D}(h) - L_{S}(h)</math>. 
We don't have the population distribution <math>D</math>. 
Instead, what we do is split the training data: <math>S = S_1, \cup S_2</math> with <math>|S_1|=|S_2|=\frac{n}{2}</math>. 
Now we have:<br>
<math>
\begin{aligned}
G \approx G' &=\sup_{h \in H} L_{S_1}(h) - L_{S_2}(h)\\
&= \sup_{h \in H} F(s_1) - F(s_2)\\
&= \sup_{h \in H} \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)}) - \frac{1}{n/2} \sum_{z^{(i)} \in S_1} f(z^{(i)})\\
&= \sup_{h \in H} \frac{2}{n} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) & \text{where }\sigma_{i} = I[z^{(i)} \in S_1] - I[z^{(i)} \in S_2]
\end{aligned}
</math>
We can use randomized partitioning such that
<math>\sigma =
\begin{cases}
1 & \text{w.p. }1/2\\
-1 & \text{w.p. }1/2
\end{cases}
</math>
<math>G \approx E_{\sigma}[G'] = \frac{2}{n} E_{\sigma} \left[ \sup_{h \in H} \sum_{i=1}^{n} \sigma_i f(z^{(i)}) \right]</math>
<math>R(A) = \frac{1}{n} E_{\sigma} \left[ \sup _a \in A \sum_{i=1}^{n} \sigma_i a_i \right]</math> 
This is called the Rademacher complexity.
<math>\implies G \approx 2 R(F \circ S)</math>
;Theorem
With probability <math>1 - \delta</math>, 
<math>L_D(h) - L_{S}(h) \leq 2 R(F \circ S) + c \sqrt{\frac{\log(4/\delta)}{n}}</math>
Example: <math>H = \{ h(x) = w^t x, \Vert w \Vert_2 \leq 1\}</math> 
<math>R(H \circ S) \leq \frac{\max_i \Vert x^{(i)} \Vert_2}{\sqrt{n}}</math>


==Misc==
==Misc==