Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 16:02, 10 September 2020

1,315 bytes added , 10 September 2020

→‎Start of Lecture 4 (September 10)

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 230: / Line 230: @@
 * Simon Du ''et al.<ref name="du2019gradient"></ref>
-===Start of Lecture 4 (September 10)===
+===Start of Lecture 4 (September 10, 2020)===
 This lecture is about Soudry ''et al.''<ref name="soudry2018implicit"></ref>.
@@ Line 284: / Line 284: @@
 This implies \(\Vert w(t) \Vert\) goes to infinity and the loss converges to 0.
 Since the loss converges to 0, GD converges to a global min.
+Assumption 3: Loss function has exponential tails.
+Exp and logistic loss satisfy this assumption.
+;Theorem
+Under assumptions 1-3, GD behaves as:
+<math>w(t) = w_{svm} \log(t) + \rho(t)</math> where <math>\Vert \rho(t) \Vert = o(\log \log t)</math>.
+This implies <math>\lim_{t \to \infty} \frac{w(t)}{\Vert w(t)} = \frac{w_{svm}}{\Vert w_{svm}}</math>.
+{{ hidden | Proof |
+Consider exponential loss:
+<math>l(u) = e^{-u}</math>
+GD in the asymptotic regime: <math>w(t)^t x_i \to \infty</math>.
+Lets assume the normalized w converges to a limit point <math>w_{\infty}</math>:
+<math>\frac{w(t)}{\Vert w(i) \Vert} \to w_{\infty}</math>.
+This implies <math>w(t) = g(t) w_{\infty} + \rho(t)</math>.
+<math>
+\begin{aligned}
+-\nabla L(w) &= \sum_{i=1}^{N} \exp(-w(t)^t x_i) x_i\\
+&\to \sum_{i=1}^{N} \exp(-g(t)w_{\infty}^t x_i) x_i\\
+\end{aligned}
+</math>
+Here <math>-g(t) w_{\infty}^t x_i</math> go to negative infinity since <math>g(t) \to \infty</math>.
+Only samples with smallest <math>w_{\infty}^t x_i</math> will contribute to the gradient:
+<math>\operatorname{argmin}_{i} w_{\infty}^t x_i</math> which are exactly the support vectors.
+Thus:
+<math>
+\begin{aligned}
+&-\nabla L(w) = \sum_{i \in SV} \alpha_i' x_i\\
+\implies &w_{\infty} = \sum_{i \in SV} \alpha_i'' x_i
+\end{aligned}
+</math>
+}}
 ==Misc==