Deep Learning: Difference between revisions
| (One intermediate revision by the same user not shown) | |||
| Line 503: | Line 503: | ||
===Linear Regression=== | ===Linear Regression=== | ||
Assume we have a dataset: | Assume we have a dataset:<br> | ||
<math>\{(x_i, y_i)\}_{i=1}^{n}</math> | <math>\{(x_i, y_i)\}_{i=1}^{n}</math> | ||
<math>y_i \in \mathbb{R}</math> | <math>y_i \in \mathbb{R}</math><br> | ||
<math>x_i \in \mathbb{R}^d</math> | <math>x_i \in \mathbb{R}^d</math><br> | ||
<math>f(w, x) = w^t x</math> | <math>f(w, x) = w^t x</math> | ||
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math> | <math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br> | ||
<math>\min_{W} L(w)</math> | <math>\min_{W} L(w)</math><br> | ||
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is: | GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br> | ||
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math> | <math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math> | ||
| Line 563: | Line 563: | ||
===Neural Networks=== | ===Neural Networks=== | ||
Consider a two-layer neural network.<br> | Consider a two-layer neural network.<br> | ||
We can write the output as: | We can write the output as:<br> | ||
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br> | <math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br> | ||
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br> | We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br> | ||
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | ||
# Init N(0,1) | # Init N(0,1) | ||
# Our weights update along a trajectory: w(0), w(1), ... | # Our weights update along a trajectory: w(0), w(1), ... | ||
# Each <math>w</math> is a weight matrix. | # Each <math>w</math> is a weight matrix. | ||
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | ||
This is called ''lazy'' training. | This is called ''lazy'' training. | ||
* Not always the case! Especially for small <math>m</math>. | * Not always the case! Especially for small <math>m</math>. | ||
Since the change in the model weights are not large, we can write the first-order taylor approximation: | Since the change in the model weights are not large, we can write the first-order taylor approximation:<br> | ||
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br> | <math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br> | ||
This model is linear in <math>w</math>.<br> | This model is linear in <math>w</math>.<br> | ||
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br> | <math>\phi(x) = \nabla_{w} f(w_0, x)</math><br> | ||
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | ||
These features will not change during the optimization process because we use <math display="inline">w_0</math> | |||
Go back to our 2-layer NN:<br> | Go back to our 2-layer NN:<br> | ||