Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 176: / Line 176: @@
 These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis.
 }}
+Caveat:
+Here we assume our loss is in a ball <math>B(w_0, R) = \{w | \Vert w - w_0 \Vert \leq R\}</math> where <math>R \leq \frac{2\sqrt{2 \beta L(w_0)}}{\mu}</math>.
+We need to show that our gradient updates never causes us to leave this ball.
+The lengths of the update vectors are <math>\Vert w_{t+1} - w_{t}\Vert = \eta \Vert \nabla L(w_t)\Vert</math>.
+To show that we never leave the ball, we need to have an upper bound on the length of our gradients <math>\Vert \nabla L(w_t) \Vert</math>.
+There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball.
+From the proof above, we have:
+<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>.
+We can use this to prove we are in the ball.
+===Why do neural networks satisfy the ''conditioning'' assumptions?===
 ==Misc==