5,337
edits
Line 190: | Line 190: | ||
===Why do neural networks satisfy the ''conditioning'' assumptions?=== | ===Why do neural networks satisfy the ''conditioning'' assumptions?=== | ||
Hessian Control | Hessian Control | ||
We can show mu-PL by showing the smallest eigenvalue of the tangent kernel is bounded: <math>\lambda_{\min}(K(w_0)) \geq \mu</math> and <math>\sup_{B} \Vert H(F)\Vert</math>. | We can show <math display="inline">mu</math>-PL by showing the smallest eigenvalue of the tangent kernel is bounded: <math>\lambda_{\min}(K(w_0)) \geq \mu</math> and <math>\sup_{B} \Vert H(F)\Vert</math>. | ||
The tangent kernel is <math>\nabla F(w) \nabla F(w)^T</math>. | The tangent kernel is <math>\nabla F(w) \nabla F(w)^T</math>. | ||
If hessian is bounded then gradients don't change too fast so if we are <math>\mu</math>-PL at the initialization then we are <math>\mu</math>-PL in a ball around the initialization. | If hessian is bounded then gradients don't change too fast so if we are <math>\mu</math>-PL at the initialization then we are <math>\mu</math>-PL in a ball around the initialization. | ||
Suppose we have a NN: <math>x \in \mathbb{R} \to y</math>. | Suppose we have a NN: <math>x \in \mathbb{R} \to y</math>. | ||
<math>f(w, x) = \frac{1}{\sqrt{m}}\sum_{i=1}^{m} v_i \sigma(w_i, x)</math>. | <math>f(w, x) = \frac{1}{\sqrt{m}}\sum_{i=1}^{m} v_i \sigma(w_i, x)</math>. | ||
;Can we prove convergence of GD for this NN? | ;Can we prove convergence of GD for this NN? | ||
<math>\nabla_{w_i} f(w, x) = \frac{1}{\sqrt{m}} v_i x \sigma'(w_i x)</math> | |||
<math>K(w, x, x) = \frac{1}{m}\sum_{i=1}^{m} v_i^2 x^2 \left(\sigma'(w_i x)\right)^2 \equiv O(1)</math> are the diagonal terms of the tangent kernel: | |||
<math>K(w) \in \mathbb{R}^{n \times n}</math>. | |||
Then the trace of the tangent kernel is also <math>O(1)</math> so <math>\Vert K(w) \Vert = O(1)</math>. | |||
<math>H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}</math> | |||
The hessian is a diagonal matrix. | |||
<math>\Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \max |v_i \sigma '' (w_j x)| = O(\frac{1}{\sqrt{m}})</math> | |||
As m goes to infinity, our hessian <math>H</math> goes to 0 and tangent kernel <math>K</math> goes to a constant. | |||
Thus Hessian control implies convergence of GD/SGD. | |||
==Misc== | ==Misc== |