Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 291: / Line 291: @@
 Under assumptions 1-3, GD behaves as:
 <math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>.
-This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t)} = \frac{w_{svm}}{\Vert w_{svm}}</math>.
+This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t) \Vert} = \frac{w_{svm}}{\Vert w_{svm} \Vert}</math>.
 {{ hidden | Proof |
 Consider exponential loss:
@@ Line 312: / Line 312: @@
 <math>
 \begin{aligned}
-&-\nabla L(w) = \sum_{i \in SV} \alpha_i' x_i\\
+&& -\nabla L(w) &= \sum_{i \in SV} \alpha_i' x_i\\
-\implies &w_{\infty} = \sum_{i \in SV} \alpha_i'' x_i
+\implies && w_{\infty} &= \sum_{i \in SV} \alpha_i'' x_i
 \end{aligned}
+</math>
+<math>
+\begin{aligned}
+\hat{w} &= \sum_{i}^{N} \alpha_i x_i\\
+\alpha_i&= 0 & x_i \notin SV\\
+\alpha_i &\neq 0 & x_i \in SV\\
+\end{aligned}
+</math>
+These are the KKT conditions for SVM.
+<math>
+\implies \frac{w_\infty}{\Vert w_\infty \Vert} = \frac{w_{SVM}}{\Vert w_{SVM} \Vert}
 </math>
 }}
+Rate of convergence is ''very'' slow:
+<math>\left\Vert \frac{w(t)}{\Vert w(t) \Vert} - \frac{w_{SVM}}{\Vert w_{SVM} \Vert } \right\Vert = O(\frac{1}{\log(t)})</math>
+===Takeaway===
+* Even when the loss is very small, if we continue GD optimization we approach the max-margin solution which improves generalization.
+* Validation or test loss may go to infinity <math>\Omega(\log (t))</math>. It is better to look at the classification error in test/validation rather than the pure loss value.
+** Even if validation loss is increasing, you should not stop GD because it will approach max-margin.
+;More recent work
+* Implicit bias exists in over-parameterized neural networks.
+** Experimentally, Adam and Adagrad do not have implicit bias but have worse generalization error.
 ==Misc==