Deep Learning: Difference between revisions

 
(37 intermediate revisions by the same user not shown)
Line 2: Line 2:


* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]


==Basics==
==Basics==
A refresher of [[Machine Learning]] and Supervised Learning.
A refresher of [[Machine Learning]] and [[Supervised Learning]].


===Empirical risk minimization (ERM)===
===Empirical risk minimization (ERM)===
Line 93: Line 94:
\begin{aligned}
\begin{aligned}
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&= \lambda_{\min}(K(w)) L(w)\\
&= \lambda_{\min}(K(w)) L(w)\\
Line 502: Line 503:


===Linear Regression===
===Linear Regression===
Assume we have a dataset:
Assume we have a dataset:<br>
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>y_i \in \mathbb{R}</math>
<math>y_i \in \mathbb{R}</math><br>
<math>x_i \in \mathbb{R}^d</math>
<math>x_i \in \mathbb{R}^d</math><br>
<math>f(w, x) = w^t x</math>
<math>f(w, x) = w^t x</math>


<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
<math>\min_{W} L(w)</math>
<math>\min_{W} L(w)</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>


Line 532: Line 533:
</math>
</math>


<math>f(w, x) = w^t \phi(x)</math>
<math>f(w, x) = w^t \phi(x)</math><br>
Is this model linear in w? Yes!
Is this model linear in w? Yes!<br>
Is this model linear in x? No!
Is this model linear in x? No!<br>


<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
Apply GD or convex optimization.
Apply GD or convex optimization.


Line 542: Line 543:
* <math>\phi</math> is fixed!
* <math>\phi</math> is fixed!
* <math>D = O(d^k)</math>
* <math>D = O(d^k)</math>
** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>


;Kernel Trick:
;Kernel Trick:
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
K is a PSD matrix.
<math display="inline">K</math> is a PSD matrix.


Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.


;Polynomial Kernels
;Polynomial Kernels
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.


Many classical techniques can be ''kernelized'':
Many classical techniques can be ''kernelized'':
SVM to Kernel SVM
* SVM to Kernel SVM
Ridge regression to Kernel ridge regression
* Ridge regression to Kernel ridge regression
PCA to Kernel PCA
* PCA to Kernel PCA


===Neural Networks===
===Neural Networks===
Consider a two-layer neural network.
Consider a two-layer neural network.<br>
We can write the output as:
We can write the output as:<br>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>


Init N(0,1)
# Init N(0,1)
Our weights update along a trajectory: w(0), w(1), ...
# Our weights update along a trajectory: w(0), w(1), ...
Each <math>w</math> is a weight matrix.
# Each <math>w</math> is a weight matrix.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
This is called ''lazy'' training.
This is called ''lazy'' training.


* Not always the case! Especially for small <math>m</math>.
* Not always the case! Especially for small <math>m</math>.


Since the change in the model weights are not large, we can write the first-order taylor approximation:
Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
This model is linear in <math>w</math>.
This model is linear in <math>w</math>.<br>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).


Go back to our 2-layer NN:
These features will not change during the optimization process because we use <math display="inline">w_0</math>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
 
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
Go back to our 2-layer NN:<br>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>


<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>


* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
Based on law of large numbers, as m goes to infinity,  
Based on law of large numbers, as m goes to infinity,<br>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>


<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>


;Q: When is this taylor approximation good?
;Q: When is this taylor approximation good?<br>
If the Hessian has bounded eigenvalues. (Hessian Control)
If the Hessian has bounded eigenvalues. (Hessian Control)


;Analyze GD:
;Analyze GD:
<math>\eta \to 0</math> Gradient-flow
<math>\eta \to 0</math> Gradient-flow<br>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>


<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
<math>
<math>
\begin{aligned}
\begin{aligned}
Line 615: Line 618:
</math>
</math>


If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).


Line 1,724: Line 1,727:
We need pairs of similar images and dissimilar images.
We need pairs of similar images and dissimilar images.


SimCLR [Chen ''et al.'' 2020]
;SimCLR [Chen ''et al.'' 2020]
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
#* Random cropping + resize
#* Random cropping + resize
Line 1,745: Line 1,748:
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.


==Misc==
Practical observations:
[[Visible to::users]]
* Composition of data augmentation is important.
* Larger batch sizes and longer training helps self-supervised learning!
* Optimization over the MLP projection head <math>g</math> helps!
** f is able to keep some information useful for the classification (e.g. color or orientation)
Empirical results:
* For image net, top-1 accuracy increases as number of parameters increases.
* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training.
 
===Theory of self-supervised learning===
;[Arora ''et al.'' 2019]
 
Modeling of similar pairs:
* <math>x \sim D_{C1}(x)</math>
* <math>x^+ \sim D_{C1}(x)</math> is semantically similar
* <math>x^- \sim D_{C2}(x)</math> negative samples
Downstream task:
* pairwise classification
** nature picks two classes C1, C2
** generate samples from C1 & C2 and evaluate the classification loss
* Assume <math>m \to \infty</math> so just look at population loss
 
Notations:
* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer
* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss
 
Result 1:
* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math>
** <math>\tau</math> is the collision probability for pair of random classes.
 
Idea result:
* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>.
* In general it is impossible to get this.
 
 
;[Tosh et al. 2020]
This work connects self-supervised learning with multi-view representation theory. 
Start with (x, z, y) where x,z are two views of the data and y is the label. 
* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both.
* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small
* Formulate contrastive learning as an artificial ''classification'' problem:
** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>.
** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>.
 
* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good
<math>
\begin{aligned}
\mu(x) &= E[E[Y|Z]|X=x] \\
&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\
&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz
\end{aligned}
</math>
 
Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math>
 
Landmark embedding: <math>g^*</math> is computed via contrastive learning. 
How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math> 
z's are randomly generated from <math>P_Z</math>. 
Each <math>g^*</math> output is a single number.
 
* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>.
** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math>
** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math>
 
Direct embedding 
Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>. 
 
Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>. 
Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>? 
* If there is a hidden variable H s.t. X and Z are conditionally independent given H.
# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math>
# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>.
 
==Meta Learning==
In many applications, we don't have a large training dataset. However, humans can adapt and learn ''on the fly''. 
The key is to use prior knowledge in performing new tasks. 
How can we train AI/ML models similarly? 
Goal of meta learning: Train a model on different learning tasks such that it can solve new tasks using only a small number of training samples.
 
Few shot classification problem: 
Inputs: <math>D_{metatrain} = \{(D_i^{train}, D_i^{test})\}_{i=1}^{n}</math>. 
<math>
\begin{cases}
D_{i}^{train} = \{(x_j^i, y_j^i) \}_{j=1}^{k}\\
D_{i}^{test} = \{(x_j^i, y_j^i) \}_{j=1}^{k'}
\end{cases}
</math>
* <math>i</math> is the index of the task
* <math>j</math> is the index of the sample
 
K-shot classification means k training samples per label. Some papers use k as the size of the whole training set. 
Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively?
 
* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that:
* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>.
* Training procedure:
** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>.
* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>.
 
[Fin et al. 2017]
* Idea: train a model that can be easy to fine-tune on new tasks
* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math>
* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math>
* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math>
 
1. Model-agnostic meta learning (MAML)
* Use GD to optimize over <math>\theta</math>
** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math>
** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive.
* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix.
 
Reptile (Michal et al 2018)
2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math>
 
[Lee et al. 2019]
* <math>f_{\theta}</math> is used to compute embeddings
* <math>A</math> is a linear classifier (e.g. SVM)
* Use dual form of SVM so # of optimization variables = # training samples * # classes
 
3. <math>A</math> is a non-parametric learning
* Embedding: <math>\tilde{x} = f_{\theta}(x)</math>
[Snell et al. 2017)
* Define prototypes (cluster centers)
** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math>
** <math>P_{\theta}(y=k|x) = \frac{\exp(-d(f_{\theta}(x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math>
 
4: <math>A</math> is a black box (e.g. LSTM).
 
==LSTMs and transformers==
Lecture 23 (November 19, 2020)
 
===Recurrent Neural Networks (RNNs)===
Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math> 
Prediction at t: <math>y_t = W_{hy} h_{t}</math>
 
;Backpropagation through time
If <math>W</math> has largest singular value < 1, then gradient vanishes. 
If <math>W</math> has largest singular value > 1, then gradient explodes. 
Typically, gradient vanishes because of initialization of <math>W</math>.
 
===Long Short Term Memory (LSTMs)===
Goal is to solve the vanishing and exploding gradient problem.
 
LSTM has several gates:
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
* Forget gate: <math>f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)</math>
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math>
 
 
;Bidirectional RNNs
First LSTM takes input in correct order. 
Second LSTM takes input in reverse order. 
Concatenate outputs from both LSTMs.
 
===Attention===
Goal: To help memorize long source sentences in machine translation.
 
;Encoder-decoder attention
 
;Self-Attention
 
===Transformer===
;Positional encoding
 
;Self-Attention
Have queries, keys, and values. 
Multiply queries with keys, pass through softmax. Then times values. 
Yields attention of every work with respect to another. 
Initially, transformers had n=8 heads giving 8 queries, keys, and values.
 
;Architecture
Stack encoders.
 
==Interpretability==
 
;Interpretability Methods
* Built-in model interpretability
* Feature level interpretability
* Instance based explanations
We will focus on feature level interpretability.
 
===Feature Level Interpretability===
These are given through saliency maps. 
* Perturbation-based: Perturb the input to get another output and compute the difference. 
* Gradient-based
 
===Gradient-based Methods===
Take the derivative of the output with respect to the input.
 
;Limitations
* Too local and sensitive to slight perturbations
* Saturated outputs lead to unintuitive gradients
* Discontinuous gradients are problematic
 
;SmoothGrad
* Add gaussian noise to input and average the gradient.
 
;Integrated Gradients
* Average the gradients along path from baseline to input.
 
;DeepLift
* We don't care about gradient but the slope relative to the ''reference'' state
 
;Limitations
* Models must be able to compute the gradient of the output with respect to the input
* Interpretation of neural networks is fragile
** Saliency maps are uninterpretable for adversarial examples on clean models and adversarially trained models.
* Needs white-box gradient access to the model.
 
===Evaluation of interpretability methods===
* Human evaluation
** Can humans evaluate saliency?
* Accuracy drop after removing ''salient'' features
* Sanity checks
** Model parameter randomization test - compare output of saliency method on trained vs untrained method to make sure saliency depends on model parameters.
* Synthetic Data
* Data randomization test
** Train on random labels and see if saliency depends on relationship between input & output.
 
Temporal saliency Rescaling
* If you remove this feature, how is the gradient going to change.
 
==Reinforcement Learning==
Lecture 27 (December 1, 2020)
 
;Goal: Sequential decision making problems
* Current decisions can affect future decisions
Examples: games, robotics, self-driving, finance
 
Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>. 
 
;Definitions
* <math>S</math> is the state space. This can be discrete or continuous.
* <math>O</math> are the observations.
* <math>A</math> is the set of actions which can be discrete or continuous.
* <math>P</math> are the transition probabilities which model the environment.
 
Example:
* <math>|S|</math> = 2
* <math>|A|</math> = 2
In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>. 
The given variables are the trajectory <math>T_{t-1}</math>. 
Modeling this can be very complex for large <math>T</math>. 
Thus, we apply a markov assumption:
* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math>
 
Objective: To maximize the rewards
* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math>
Two regimes for rewards
* Finite horizon; T is finite
* Infinite horizon; T goes to infinity
* Discount factor
 
===Classical RL===
Finite Markov Decision Process (MDP)
* S: Finite number of states
* A: Finite number of actions
* <math>R_t = R(a_t, S_t)</math>
* r discount factor
 
Goal: Choose actions to maximize the total rewards. 
* Policy function <math>\pi</math> determines how actions should be taken.
* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions.
* Assume this is time invariant.
* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>.
* Q function (action-value function)
** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math>
* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>?
** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math>
* Use Bellman's equation
** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math>
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. 
All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1. 
This implies <math>I - \gamma P_{\pi}</math> is always invertible. 
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
 
===Value-Iteration (Bellman Recursion)===
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> 
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> 
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. 
 
;Claim: <math>L_{\pi}</math> is a contraction.
Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>. 
<math>
\begin{aligned}
L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\
&= \gamma P_{\pi}(v_1 - v_2)
\end{aligned}
</math> 
<math>
\begin{aligned}
\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\
& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\
& \leq \Vert v_1 - v_2 \Vert
\end{aligned}
</math> 
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
 
===Optimal Policy===
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
 
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. 
The policy may not be unique but if two policies are equal, they have the same value function.
 
Intermediate questions:
* Given <math>v^*</math>, can we compute <math>Q^*</math>?
<math>
\begin{aligned}
Q^*(s, a) &= \max_{\pi} Q(s,a)\\
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
\end{aligned}
</math>
 
* Given <math>Q^*</math>, can we compute <math>v^*</math>.
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. 
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
 
;How to compare optimal policies?
First approach: Value Iteration
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> 
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> 
<math>v^*</math> is a fixed point of <math>L(v)</math>. 
Claim: <math>L</math> is also a contraction. 
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. 
 
===Value Iteration===
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
 
===Policy Iteration===
Directly optimize policy.
 
* Start with a random policy <math>\pi^{(0)}</math>
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
* Find the best action to be taken at state <math>s</math>:
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
* Repeat until convergence.
 
Policy iteration is guaranteed to converge to an optimal policy. 
Oftentimes converges faster than value iteration.
 
===Deep Reinforcement Learning===
 
;Relaxing some unrealistic assumptions
# Evaluate <math>v_{\pi}(s)</math>
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>
# Improve the policy
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>
#* Assumption <math>|S|=d</math>
# How to represent <math>V(s)</math>?
#* Can we use a neural network to represent <math>V(s)</math>? Yes
 
;How to train <math>v_{\phi}</math>?
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. 
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> 
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: 
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> 
* Iterate
 
;Similarly we can parameterize the Q function
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> 
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).
 
With model-free RL: 
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> 
This is called ''Q-Learning''.
 
;What if we have continuous actions:
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve
** [Gu ''et al.'' 2016] use quadratic functions.
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
 
===Policy Gradient Method===
Lecture 29 (Dec 8, 2020) 
 
Probability of observing a trajectory: 
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> 
Reward for a trajectory: 
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> 
The average reward is: 
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
 
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
 
Gradient of the average reward: 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\
&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\
&= E[\log P_\theta(\tau) R(\tau)]
\end{aligned}
</math>
 
<math>
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
</math> 
Implies, 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &= E[...]\\
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
\end{aligned}
</math>
 
;Summary
* Sample trajectories
* Approximate <math>\nabla_{\theta} J(\theta)</math>
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
 
;Intuition
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> 
Formalizing ''trial & error''. 
[Finn & Levin, ICML]
 
===Issues with Policy Gradient===
;High variance of gradient estimation
 
;Solutions
* Subtract a baseline
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
* Reward-to-go
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
\end{aligned}
</math>
 
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
 
===Actor-critic algorithms===
Have an actor <math>\pi_{\theta}</math>. 
Have a critic <math>V_{\phi}/Q</math>
 
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
 
===Other topics in RL===
* Inverse RL
* Multi-agent RL
* Model-based RL
 
==Summary of Course==
;What we covered
* Supervised DL
* Unsupervised DL (GANs, VAEs)
* Self-supervised DL
* Meta-Learning
* Learning with Attention (Transformers)
* Deep RL
 
* Optimization
* Generalization
* Robustness
* Interpretability
 
;What we didn't cover
* Fairness
* Privacy & Ethics
* Bayesian DL
* Federated Learning
* Graph NNs
 
;Things which may be on the final
* Transformers
* Wasserstein distance
* Kernel methods


==Resources==
==Resources==