Deep Learning: Difference between revisions

1,083 bytes added ,  10 September 2020
Line 291: Line 291:
Under assumptions 1-3, GD behaves as:
Under assumptions 1-3, GD behaves as:
<math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>.   
<math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>.   
This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t)} = \frac{w_{svm}}{\Vert w_{svm}}</math>.
This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t) \Vert} = \frac{w_{svm}}{\Vert w_{svm} \Vert}</math>.
{{ hidden | Proof |
{{ hidden | Proof |
Consider exponential loss:   
Consider exponential loss:   
Line 312: Line 312:
<math>
<math>
\begin{aligned}
\begin{aligned}
&-\nabla L(w) = \sum_{i \in SV} \alpha_i' x_i\\
&& -\nabla L(w) &= \sum_{i \in SV} \alpha_i' x_i\\
\implies &w_{\infty} = \sum_{i \in SV} \alpha_i'' x_i
\implies && w_{\infty} &= \sum_{i \in SV} \alpha_i'' x_i
\end{aligned}
\end{aligned}
</math>
<math>
\begin{aligned}
\hat{w} &= \sum_{i}^{N} \alpha_i x_i\\
\alpha_i&= 0 & x_i \notin SV\\
\alpha_i &\neq 0 & x_i \in SV\\
\end{aligned}
</math>
These are the KKT conditions for SVM.
<math>
\implies \frac{w_\infty}{\Vert w_\infty \Vert} = \frac{w_{SVM}}{\Vert w_{SVM} \Vert}
</math>
</math>
}}
}}
Rate of convergence is ''very'' slow: 
<math>\left\Vert \frac{w(t)}{\Vert w(t) \Vert} - \frac{w_{SVM}}{\Vert w_{SVM} \Vert } \right\Vert = O(\frac{1}{\log(t)})</math>
===Takeaway===
* Even when the loss is very small, if we continue GD optimization we approach the max-margin solution which improves generalization.
* Validation or test loss may go to infinity <math>\Omega(\log (t))</math>. It is better to look at the classification error in test/validation rather than the pure loss value.
** Even if validation loss is increasing, you should not stop GD because it will approach max-margin.
;More recent work
* Implicit bias exists in over-parameterized neural networks.
** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error.


==Misc==
==Misc==