Deep Learning: Difference between revisions

Line 176: Line 176:
These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis.
These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis.
}}
}}
Caveat: 
Here we assume our loss is in a ball <math>B(w_0, R) = \{w | \Vert w - w_0 \Vert \leq R\}</math> where <math>R \leq \frac{2\sqrt{2 \beta L(w_0)}}{\mu}</math>. 
We need to show that our gradient updates never causes us to leave this ball. 
The lengths of the update vectors are <math>\Vert w_{t+1} - w_{t}\Vert = \eta \Vert \nabla L(w_t)\Vert</math>. 
To show that we never leave the ball, we need to have an upper bound on the length of our gradients <math>\Vert \nabla L(w_t) \Vert</math>. 
There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball. 
From the proof above, we have:
<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>.
We can use this to prove we are in the ball.
===Why do neural networks satisfy the ''conditioning'' assumptions?===


==Misc==
==Misc==