5,337
edits
Line 169: | Line 169: | ||
</math> | </math> | ||
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. | This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. | ||
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | ||
Similar results hold for SGD. | |||
{{hidden | Q&A | | {{hidden | Q&A | | ||
Line 184: | Line 185: | ||
There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball. | There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball. | ||
From the proof above, we have: | From the proof above, we have: | ||
<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>. | <math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>. | ||
We can use this to prove we are in the ball. | We can use this to prove we are in the ball. | ||