Deep Learning: Difference between revisions

Line 169: Line 169:
</math>
</math>
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>.   
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>.   
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>.
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>
Similar results hold for SGD.


{{hidden | Q&A |
{{hidden | Q&A |
Line 184: Line 185:
There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball.   
There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball.   
From the proof above, we have:
From the proof above, we have:
<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>.
<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>.
We can use this to prove we are in the ball.
We can use this to prove we are in the ball.