5,337
edits
Line 171: | Line 171: | ||
Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | Thus we see a geometric or exponential decrease in our loss function with convergence rate <math>(1-\eta \mu)</math>. | ||
If we don't have convergence, we're not sure if this means we're violating the <math>\mu</math>-PL condition. | {{hidden | Q&A | | ||
It is possible one of the other assumptions is violated (e.g. if learning rate is too large). | If we don't have convergence, we're not sure if this means we're violating the <math display="inline">\mu</math>-PL condition. | ||
It is possible one of the other assumptions is violated (e.g. if learning rate is too large). | |||
These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis. | |||
}} | |||
==Misc== | ==Misc== |