Deep Learning: Difference between revisions

 
(11 intermediate revisions by the same user not shown)
Line 2: Line 2:


* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]


==Basics==
==Basics==
A refresher of [[Machine Learning]] and Supervised Learning.
A refresher of [[Machine Learning]] and [[Supervised Learning]].


===Empirical risk minimization (ERM)===
===Empirical risk minimization (ERM)===
Line 502: Line 503:


===Linear Regression===
===Linear Regression===
Assume we have a dataset:
Assume we have a dataset:<br>
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>y_i \in \mathbb{R}</math>
<math>y_i \in \mathbb{R}</math><br>
<math>x_i \in \mathbb{R}^d</math>
<math>x_i \in \mathbb{R}^d</math><br>
<math>f(w, x) = w^t x</math>
<math>f(w, x) = w^t x</math>


<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
<math>\min_{W} L(w)</math>
<math>\min_{W} L(w)</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>


Line 532: Line 533:
</math>
</math>


<math>f(w, x) = w^t \phi(x)</math>
<math>f(w, x) = w^t \phi(x)</math><br>
Is this model linear in w? Yes!
Is this model linear in w? Yes!<br>
Is this model linear in x? No!
Is this model linear in x? No!<br>


<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
Apply GD or convex optimization.
Apply GD or convex optimization.


Line 542: Line 543:
* <math>\phi</math> is fixed!
* <math>\phi</math> is fixed!
* <math>D = O(d^k)</math>
* <math>D = O(d^k)</math>
** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>


;Kernel Trick:
;Kernel Trick:
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
K is a PSD matrix.
<math display="inline">K</math> is a PSD matrix.


Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.


;Polynomial Kernels
;Polynomial Kernels
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.


Many classical techniques can be ''kernelized'':
Many classical techniques can be ''kernelized'':
SVM to Kernel SVM
* SVM to Kernel SVM
Ridge regression to Kernel ridge regression
* Ridge regression to Kernel ridge regression
PCA to Kernel PCA
* PCA to Kernel PCA


===Neural Networks===
===Neural Networks===
Consider a two-layer neural network.
Consider a two-layer neural network.<br>
We can write the output as:
We can write the output as:<br>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>


Init N(0,1)
# Init N(0,1)
Our weights update along a trajectory: w(0), w(1), ...
# Our weights update along a trajectory: w(0), w(1), ...
Each <math>w</math> is a weight matrix.
# Each <math>w</math> is a weight matrix.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
This is called ''lazy'' training.
This is called ''lazy'' training.


* Not always the case! Especially for small <math>m</math>.
* Not always the case! Especially for small <math>m</math>.


Since the change in the model weights are not large, we can write the first-order taylor approximation:
Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
This model is linear in <math>w</math>.
This model is linear in <math>w</math>.<br>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).


Go back to our 2-layer NN:
These features will not change during the optimization process because we use <math display="inline">w_0</math>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
 
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
Go back to our 2-layer NN:<br>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>


<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>


* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
Based on law of large numbers, as m goes to infinity,  
Based on law of large numbers, as m goes to infinity,<br>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>


<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>


;Q: When is this taylor approximation good?
;Q: When is this taylor approximation good?<br>
If the Hessian has bounded eigenvalues. (Hessian Control)
If the Hessian has bounded eigenvalues. (Hessian Control)


;Analyze GD:
;Analyze GD:
<math>\eta \to 0</math> Gradient-flow
<math>\eta \to 0</math> Gradient-flow<br>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>


<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
<math>
<math>
\begin{aligned}
\begin{aligned}
Line 615: Line 618:
</math>
</math>


If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).


Line 2,165: Line 2,168:


;Intuition
;Intuition
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> 
Formalizing ''trial & error''. 
[Finn & Levin, ICML]
===Issues with Policy Gradient===
;High variance of gradient estimation
;Solutions
* Subtract a baseline
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
* Reward-to-go
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
\end{aligned}
</math>
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
===Actor-critic algorithms===
Have an actor <math>\pi_{\theta}</math>. 
Have a critic <math>V_{\phi}/Q</math>
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
===Other topics in RL===
* Inverse RL
* Multi-agent RL
* Model-based RL
==Summary of Course==
;What we covered
* Supervised DL
* Unsupervised DL (GANs, VAEs)
* Self-supervised DL
* Meta-Learning
* Learning with Attention (Transformers)
* Deep RL
* Optimization
* Generalization
* Robustness
* Interpretability
;What we didn't cover
* Fairness
* Privacy & Ethics
* Bayesian DL
* Federated Learning
* Graph NNs


==Misc==
;Things which may be on the final
[[Visible to::users]]
* Transformers
* Wasserstein distance
* Kernel methods


==Resources==
==Resources==