Jump to content

Deep Learning: Difference between revisions

 
Line 563: Line 563:
===Neural Networks===
===Neural Networks===
Consider a two-layer neural network.<br>
Consider a two-layer neural network.<br>
We can write the output as:
We can write the output as:<br>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>


# Init N(0,1)
# Init N(0,1)
# Our weights update along a trajectory: w(0), w(1), ...
# Our weights update along a trajectory: w(0), w(1), ...
# Each <math>w</math> is a weight matrix.
# Each <math>w</math> is a weight matrix.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
This is called ''lazy'' training.
This is called ''lazy'' training.


* Not always the case! Especially for small <math>m</math>.
* Not always the case! Especially for small <math>m</math>.


Since the change in the model weights are not large, we can write the first-order taylor approximation:
Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
This model is linear in <math>w</math>.<br>
This model is linear in <math>w</math>.<br>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
These features will not change during the optimization process because we use <math display="inline">w_0</math>


Go back to our 2-layer NN:<br>
Go back to our 2-layer NN:<br>