5,337
edits
Line 199: | Line 199: | ||
;Can we prove convergence of GD for this NN? | ;Can we prove convergence of GD for this NN? | ||
<math>\nabla_{w_i} f(w, x) = \frac{1}{\sqrt{m}} v_i x \sigma'(w_i x)</math> | <math>\nabla_{w_i} f(w, x) = \frac{1}{\sqrt{m}} v_i x \sigma'(w_i x)</math> | ||
<math>K(w, | <math>K(w, x_j, x_j) = \frac{1}{m}\sum_{i=1}^{m} v_i^2 x^2 \left(\sigma'(w_i x)\right)^2 \equiv O(1)</math> are the j-th diagonal terms of the tangent kernel: | ||
<math>K(w) \in \mathbb{R}^{n \times n}</math>. | <math>K(w) \in \mathbb{R}^{n \times n}</math>. | ||
Then the trace of the tangent kernel is also <math>O(1)</math> so <math>\Vert K(w) \Vert = O(1)</math>. | Then the trace of the tangent kernel is also <math>O(1)</math> so <math>\Vert K(w) \Vert = O(1)</math>. | ||
<math>H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}</math> | <math>H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}</math> | ||
The hessian is a diagonal matrix. | The hessian is a diagonal matrix. | ||
<math>\Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \ | <math>\Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \max_{i \in [m]} | v_i \sigma '' (w_j x)| = O(\frac{1}{\sqrt{m}})</math> | ||
As m goes to infinity, our hessian <math>H</math> goes to 0 and tangent kernel <math>K</math> goes to a constant. | As m goes to infinity, our hessian <math>H</math> goes to 0 and tangent kernel <math>K</math> goes to a constant. | ||
Thus Hessian control implies convergence of GD/SGD. | Thus Hessian control implies convergence of GD/SGD. | ||
* However this argument assumes the model is ''almost linear''. | |||
* This hessian control can be extended to L-layer NN with <math>m=\omega(\frac{n}{\mu})</math> | |||
Example: | |||
<math>g(w, x) = \phi(\frac{1}{\sqrt{m}}\sum_{i=1}^{m}v_i \sigma(w_i x))</math> | |||
then | |||
*<math>\Vert K(w) \Vert = O(1)</math> | |||
*<math>\Vert H \Vert = O(1)</math> if <math>(\phi'' \neq 0)</math> | |||
==Misc== | ==Misc== |