Deep Learning: Difference between revisions

Line 201:

<math>K(w, x_j, x_j) = \frac{1}{m}\sum_{i=1}^{m} v_i^2 x^2 \left(\sigma'(w_i x)\right)^2 \equiv O(1)</math> are the j-th diagonal terms of the tangent kernel:

<math>K(w) \in \mathbb{R}^{n \times n}</math>.

Then the trace of the tangent kernel is also <math>O(1)</math> so <math>\Vert K(w) \Vert = O(1)</math>.

Then the trace of the tangent kernel is also <math>O(1)</math> so the eigenvalues are bounded: <math>\Vert K(w) \Vert = O(1)</math>.

<math>H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}</math>

The hessian is a diagonal matrix.

The hessian is a diagonal matrix. The spectral norm of the hessian (the maximum eigenvalue) is the maximum of the diagonal elements:\\

<math>\Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \max_{i \in [m]} | v_i \sigma '' (w_j x)| = O(\frac{1}{\sqrt{m}})</math>

As m goes to infinity, our hessian <math>H</math> goes to 0 and tangent kernel <math>K</math> goes to a constant.

@@ Line 201: / Line 201: @@
 <math>K(w, x_j, x_j) = \frac{1}{m}\sum_{i=1}^{m} v_i^2 x^2 \left(\sigma'(w_i x)\right)^2 \equiv O(1)</math> are the j-th diagonal terms of the tangent kernel:
 <math>K(w) \in \mathbb{R}^{n \times n}</math>.
-Then the trace of the tangent kernel is also <math>O(1)</math> so <math>\Vert K(w) \Vert = O(1)</math>.
+Then the trace of the tangent kernel is also <math>O(1)</math> so the eigenvalues are bounded: <math>\Vert K(w) \Vert = O(1)</math>.
 <math>H_{ij} = \frac{1}{\sqrt{m}} v_i \sigma '' (w_j x) x^2 1_{i=j}</math>
-The hessian is a diagonal matrix.
+The hessian is a diagonal matrix. The spectral norm of the hessian (the maximum eigenvalue) is the maximum of the diagonal elements:\\
 <math>\Vert H \Vert_2 = \max_{i \in [m]} H_{ii} = \frac{x^2}{\sqrt{m}} \max_{i \in [m]} | v_i \sigma '' (w_j x)| = O(\frac{1}{\sqrt{m}})</math>
 As m goes to infinity, our hessian <math>H</math> goes to 0 and tangent kernel <math>K</math> goes to a constant.