5,337
edits
Line 176: | Line 176: | ||
These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis. | These arguments should hold for ReLU networks since the non-differentiable points are measure 0 but it would require a more careful analysis. | ||
}} | }} | ||
Caveat: | |||
Here we assume our loss is in a ball <math>B(w_0, R) = \{w | \Vert w - w_0 \Vert \leq R\}</math> where <math>R \leq \frac{2\sqrt{2 \beta L(w_0)}}{\mu}</math>. | |||
We need to show that our gradient updates never causes us to leave this ball. | |||
The lengths of the update vectors are <math>\Vert w_{t+1} - w_{t}\Vert = \eta \Vert \nabla L(w_t)\Vert</math>. | |||
To show that we never leave the ball, we need to have an upper bound on the length of our gradients <math>\Vert \nabla L(w_t) \Vert</math>. | |||
There is a tradeoff because for PL, we want a large <math>\mu</math> to lower bound the gradients but that would require satisfying PL over a large ball. If <math>\mu</math> is small, then we have fast (large) updates but over a small ball. | |||
From the proof above, we have: | |||
<math>L(w_{t+1}) \leq L(w_t) - \frac{\eta}{2} \Vert \nabla L(w_t) \Vert^2</math>. | |||
We can use this to prove we are in the ball. | |||
===Why do neural networks satisfy the ''conditioning'' assumptions?=== | |||
==Misc== | ==Misc== |