Deep Learning: Difference between revisions

Line 216: Line 216:
*<math>\Vert K(w) \Vert = O(1)</math>
*<math>\Vert K(w) \Vert = O(1)</math>
*<math>\Vert H \Vert = O(1)</math> if <math>(\phi'' \neq 0)</math>
*<math>\Vert H \Vert = O(1)</math> if <math>(\phi'' \neq 0)</math>
so we cannot use hessian control.
;Lemma: 
If f is mu-PL then <math>\phi \circ f</math> is <math>\mu \rho^2</math>-PL (<math>| \phi'(f(w,x))| > \phi</math>). 
implies <math>L(w_t) \leq (1- \eta \mu \rho^2)L(w_0)</math>. 
GD converges even though our model does not go to a linear model.
==Take-away==
Over-parameterization does not lead to linearization.
Over-parameterization leads to good conditioning which leads to PL and convergence of GD/SGD.
Other papers:
* Simon Du ''et al.<ref name="du2019gradient"></ref>


==Misc==
==Misc==