Deep Learning: Difference between revisions

Deep Learning (view source)

523 bytes added , 8 September 2020

5,337

edits

@@ Line 216: / Line 216: @@
 *<math>\Vert K(w) \Vert = O(1)</math>
 *<math>\Vert H \Vert = O(1)</math> if <math>(\phi'' \neq 0)</math>
+so we cannot use hessian control.
+;Lemma:
+If f is mu-PL then <math>\phi \circ f</math> is <math>\mu \rho^2</math>-PL (<math>| \phi'(f(w,x))| > \phi</math>).
+implies <math>L(w_t) \leq (1- \eta \mu \rho^2)L(w_0)</math>.
+GD converges even though our model does not go to a linear model.
+==Take-away==
+Over-parameterization does not lead to linearization.
+Over-parameterization leads to good conditioning which leads to PL and convergence of GD/SGD.
+Other papers:
+* Simon Du ''et al.<ref name="du2019gradient"></ref>
 ==Misc==