Deep Learning: Difference between revisions

Deep Learning (view source)

34 bytes added , 8 September 2020

5,337

edits

@@ Line 169: / Line 169: @@
 </math>
 This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>.
 Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>.
+Similar results hold for SGD.
 {{hidden | Q&A |
@@ Line 184: / Line 185: @@
 There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball.
 From the proof above, we have:
 <math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>.
 We can use this to prove we are in the ball.