5,337
edits
Line 159: | Line 159: | ||
<math display="block"> | <math display="block"> | ||
\begin{aligned} | \begin{aligned} | ||
L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w)(w_{t+1}-w_{t})\\ | L(w_{t+1}) &= L(w_t) + (w_{t+1}-w_t)^T \nabla f(w_t) + \frac{1}{2}(w_{t+1}-w_t)^T H(w')(w_{t+1}-w_{t})\\ | ||
&=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w)(-\eta) \nabla L(w_t)\\ | &=L(w_t) + (-\eta)\nabla L(w_t)^T \nabla f(w_t) + \frac{1}{2}(-\eta)\nabla L(w_t)^T H(w')(-\eta) \nabla L(w_t)\\ | ||
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w) \nabla L(w_t)\\ | &\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 + \frac{\eta^2}{2} \nabla L(w_t)^T H(w') \nabla L(w_t)\\ | ||
&\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\ | &\leq L(w_t) - \eta \Vert \nabla L(w_t) \Vert^2 (1-\frac{\eta \beta}{2}) &\text{by assumption 3}\\ | ||
&\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\ | &\leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2 &\text{by assumption 4}\\ | ||
Line 170: | Line 170: | ||
This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. | This implies our loss at any iteration is <math>L(w_t) \leq (1-\eta \mu)^t L(w_0)</math>. | ||
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | ||
If we don't have convergence, we're not sure if this means we're violating the <math>\mu</math>-PL condition. | |||
It is possible one of the other assumptions is violated (e.g. if learning rate is too large). | |||
==Misc== | ==Misc== |