5,337
edits
Line 291: | Line 291: | ||
Under assumptions 1-3, GD behaves as: | Under assumptions 1-3, GD behaves as: | ||
<math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>. | <math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>. | ||
This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t)} = \frac{w_{svm}}{\Vert w_{svm}}</math>. | This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t) \Vert} = \frac{w_{svm}}{\Vert w_{svm} \Vert}</math>. | ||
{{ hidden | Proof | | {{ hidden | Proof | | ||
Consider exponential loss: | Consider exponential loss: | ||
Line 312: | Line 312: | ||
<math> | <math> | ||
\begin{aligned} | \begin{aligned} | ||
&-\nabla L(w) = \sum_{i \in SV} \alpha_i' x_i\\ | && -\nabla L(w) &= \sum_{i \in SV} \alpha_i' x_i\\ | ||
\implies &w_{\infty} = \sum_{i \in SV} \alpha_i'' x_i | \implies && w_{\infty} &= \sum_{i \in SV} \alpha_i'' x_i | ||
\end{aligned} | \end{aligned} | ||
</math> | |||
<math> | |||
\begin{aligned} | |||
\hat{w} &= \sum_{i}^{N} \alpha_i x_i\\ | |||
\alpha_i&= 0 & x_i \notin SV\\ | |||
\alpha_i &\neq 0 & x_i \in SV\\ | |||
\end{aligned} | |||
</math> | |||
These are the KKT conditions for SVM. | |||
<math> | |||
\implies \frac{w_\infty}{\Vert w_\infty \Vert} = \frac{w_{SVM}}{\Vert w_{SVM} \Vert} | |||
</math> | </math> | ||
}} | }} | ||
Rate of convergence is ''very'' slow: | |||
<math>\left\Vert \frac{w(t)}{\Vert w(t) \Vert} - \frac{w_{SVM}}{\Vert w_{SVM} \Vert } \right\Vert = O(\frac{1}{\log(t)})</math> | |||
===Takeaway=== | |||
* Even when the loss is very small, if we continue GD optimization we approach the max-margin solution which improves generalization. | |||
* Validation or test loss may go to infinity <math>\Omega(\log (t))</math>. It is better to look at the classification error in test/validation rather than the pure loss value. | |||
** Even if validation loss is increasing, you should not stop GD because it will approach max-margin. | |||
;More recent work | |||
* Implicit bias exists in over-parameterized neural networks. | |||
** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error. | |||
==Misc== | ==Misc== |