Deep Learning: Difference between revisions

Line 582:

The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).

Go back to our 2-layer NN:

Go back to our 2-layer NN:

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>

* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.

Based on law of large numbers, as m goes to infinity,

Based on law of large numbers, as m goes to infinity,

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>

;Q: When is this taylor approximation good?

;Q: When is this taylor approximation good?

If the Hessian has bounded eigenvalues. (Hessian Control)

;Analyze GD:

<math>\eta \to 0</math> Gradient-flow

<math>\eta \to 0</math> Gradient-flow

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>

\begin{aligned}

Line 616:

</math>

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).

@@ Line 582: / Line 582: @@
 The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
-Go back to our 2-layer NN:
+Go back to our 2-layer NN:<br>
-<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
+<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
-<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
+<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
 <math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
-<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
+<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
-<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
+<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
 <math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
 * <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
-Based on law of large numbers, as m goes to infinity,
+Based on law of large numbers, as m goes to infinity,<br>
-<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
+<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
 <math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
-<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
+<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
 <math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
-;Q: When is this taylor approximation good?
+;Q: When is this taylor approximation good?<br>
 If the Hessian has bounded eigenvalues. (Hessian Control)
 ;Analyze GD:
-<math>\eta \to 0</math> Gradient-flow
+<math>\eta \to 0</math> Gradient-flow<br>
-<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
+<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
 <math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
-<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
+<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
 <math>
 \begin{aligned}
@@ Line 616: / Line 616: @@
 </math>
-If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
+If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
-This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
+This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
 In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).