Deep Learning: Difference between revisions
Line 532: | Line 532: | ||
</math> | </math> | ||
<math>f(w, x) = w^t \phi(x)</math> | |||
Is this model linear in w? Yes! | Is this model linear in w? Yes! | ||
Is this model linear in x? No! | Is this model linear in x? No! | ||
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math> | |||
Apply GD or convex optimization. | |||
Q: What is the issue here? | |||
* <math>\phi</math> is fixed! | |||
* <math>D = O(d^k)</math> | |||
** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math> | |||
;Kernel Trick: | |||
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>. | |||
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>. | |||
K is a PSD matrix. | |||
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>. | |||
;Polynomial Kernels | |||
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math> | |||
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>. | |||
Many classical techniques can be ''kernelized'': | |||
SVM to Kernel SVM | |||
Ridge regression to Kernel ridge regression | |||
PCA to Kernel PCA | |||
===Neural Networks=== | |||
Consider a two-layer neural network. | |||
We can write the output as: | |||
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math> | |||
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math> | |||
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | |||
Init N(0,1) | |||
Our weights update along a trajectory: w(0), w(1), ... | |||
Each <math>w</math> is a weight matrix. | |||
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | |||
This is called ''lazy'' training. | |||
* Not always the case! Especially for small <math>m</math>. | |||
Since the change in the model weights are not large, we can write the first-order taylor approximation: | |||
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math> | |||
This model is linear in <math>w</math>. | |||
<math>\phi(x) = \nabla_{w} f(w_0, x)</math> | |||
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | |||
Go back to our 2-layer NN: | |||
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math> | |||
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math> | |||
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math> | |||
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math> | |||
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math> | |||
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math> | |||
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization. | |||
Based on law of large numbers, as m goes to infinity, | |||
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math> | |||
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math> | |||
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math> | |||
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math> | |||
;Q: When is this taylor approximation good? | |||
If the Hessian has bounded eigenvalues. (Hessian Control) | |||
;Analyze GD: | |||
<math>\eta \to 0</math> Gradient-flow | |||
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math> | |||
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math> | |||
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math> | |||
<math> | |||
\begin{aligned} | |||
\frac{d\hat{y}(w(t)}{dt} &= - \nabla_{w} \hat{y}(w(t))^2 \frac{dw(t)}{dt} \\ | |||
&= - \nabla_{w} \hat{y}(w(t))^2 \nabla_{w} \hat{y}(w) (\hat{y}(w) - y)\\ | |||
&\approx -K(w_0) (\hat{y}(w(t))-y) | |||
\end{aligned} | |||
</math> | |||
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>. | |||
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>. | |||
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite). | |||
* SOTA NNs often outperform kernel methods (even based on NTKs) | |||
==Misc== | ==Misc== |