Deep Learning: Difference between revisions

 
(54 intermediate revisions by the same user not shown)
Line 2: Line 2:


* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]


==Basics==
==Basics==
A refresher of [[Machine Learning]] and Supervised Learning.
A refresher of [[Machine Learning]] and [[Supervised Learning]].


===Empirical risk minimization (ERM)===
===Empirical risk minimization (ERM)===
Line 93: Line 94:
\begin{aligned}
\begin{aligned}
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
&= \lambda_{\min}(K(w)) L(w)\\
&= \lambda_{\min}(K(w)) L(w)\\
Line 502: Line 503:


===Linear Regression===
===Linear Regression===
Assume we have a dataset:
Assume we have a dataset:<br>
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>\{(x_i, y_i)\}_{i=1}^{n}</math>   
<math>y_i \in \mathbb{R}</math>
<math>y_i \in \mathbb{R}</math><br>
<math>x_i \in \mathbb{R}^d</math>
<math>x_i \in \mathbb{R}^d</math><br>
<math>f(w, x) = w^t x</math>
<math>f(w, x) = w^t x</math>


<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
<math>\min_{W} L(w)</math>
<math>\min_{W} L(w)</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>


Line 532: Line 533:
</math>
</math>


<math>f(w, x) = w^t \phi(x)</math>
<math>f(w, x) = w^t \phi(x)</math><br>
Is this model linear in w? Yes!
Is this model linear in w? Yes!<br>
Is this model linear in x? No!
Is this model linear in x? No!<br>


<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
Apply GD or convex optimization.
Apply GD or convex optimization.


Line 542: Line 543:
* <math>\phi</math> is fixed!
* <math>\phi</math> is fixed!
* <math>D = O(d^k)</math>
* <math>D = O(d^k)</math>
** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>


;Kernel Trick:
;Kernel Trick:
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
K is a PSD matrix.
<math display="inline">K</math> is a PSD matrix.


Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.


;Polynomial Kernels
;Polynomial Kernels
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.


Many classical techniques can be ''kernelized'':
Many classical techniques can be ''kernelized'':
SVM to Kernel SVM
* SVM to Kernel SVM
Ridge regression to Kernel ridge regression
* Ridge regression to Kernel ridge regression
PCA to Kernel PCA
* PCA to Kernel PCA


===Neural Networks===
===Neural Networks===
Consider a two-layer neural network.
Consider a two-layer neural network.<br>
We can write the output as:
We can write the output as:<br>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>


Init N(0,1)
# Init N(0,1)
Our weights update along a trajectory: w(0), w(1), ...
# Our weights update along a trajectory: w(0), w(1), ...
Each <math>w</math> is a weight matrix.
# Each <math>w</math> is a weight matrix.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
This is called ''lazy'' training.
This is called ''lazy'' training.


* Not always the case! Especially for small <math>m</math>.
* Not always the case! Especially for small <math>m</math>.


Since the change in the model weights are not large, we can write the first-order taylor approximation:
Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
This model is linear in <math>w</math>.
This model is linear in <math>w</math>.<br>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).


Go back to our 2-layer NN:
These features will not change during the optimization process because we use <math display="inline">w_0</math>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
 
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
Go back to our 2-layer NN:<br>
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>


<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>


* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
Based on law of large numbers, as m goes to infinity,  
Based on law of large numbers, as m goes to infinity,<br>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>


<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>


;Q: When is this taylor approximation good?
;Q: When is this taylor approximation good?<br>
If the Hessian has bounded eigenvalues. (Hessian Control)
If the Hessian has bounded eigenvalues. (Hessian Control)


;Analyze GD:
;Analyze GD:
<math>\eta \to 0</math> Gradient-flow
<math>\eta \to 0</math> Gradient-flow<br>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>


<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
<math>
<math>
\begin{aligned}
\begin{aligned}
Line 615: Line 618:
</math>
</math>


If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).


Line 1,419: Line 1,422:
An <math>f</math> function which is element-wise would have a diagonal jacobian. This is not very expressive.   
An <math>f</math> function which is element-wise would have a diagonal jacobian. This is not very expressive.   


===Upper-triangular Jacobian===
RealNVP by [Dint et al] considers an upper-triangular matrix.   
RealNVP by [Dint et al] considers an upper-triangular matrix.   
In this case, the determinant is still the diagonal.   
In this case, the determinant is still the diagonal.   
Line 1,435: Line 1,439:
\end{pmatrix}
\end{pmatrix}
</math>
</math>
and <math>det(J) = \prod (S_\theta)_i</math>.
and <math>det(J) = \prod (S_\theta)_i</math>.
Is this expressive enough? No.
 
===Composition of transformations===
<math>f = f_1 \circ f_2 \circ ... \circ f_k</math> 
A sequence of invertible transformations is called normalizing flows.
 
Here, we have the following properties
* <math>(f_1 \circ f_2)^{-1} = f_2^{-1} \circ f_1^{-1}</math>
* <math>\nabla(f_2 \circ f_1)(x) = \nabla f_2(f_1(x)) \nabla f_1(x)</math>
* <math>det(J_1 J_2) = det(J_1) det(J_2)</math>
 
RealNVP uses two types of partitioning
* Checkerboard
* Channel partitioning
 
In Glow [Kingmen et al], they use invertible 1x1 convolution which is equivalent to matrix multiplication. 
Use <math>W = P L (U+diag(s))</math> and <math>| det(W)| = sum(\log |S|)</math>
 
===Dequantization===
Fitting a ''density'' function to discrete data can have crazy peaks. 
For dequantization, add a uniform to the data to get a more stable density function.
 
<math>
\begin{aligned}
\log \int p(x_i +\delta) p(\delta)ds &= \log E_\delta [\log p(x_i + \delta)]\\
&\geq E_{\delta}[\log p(x_i + \delta)]\\
&\approx \log p(x_i + \delta)
\end{aligned}
</math>
 
* We have exact likelihood estimations.
** We can use out-of-distribution anomaly detection.
 
However in practice after training on CIFAR, the liklihood of MNIST is higher. 
This behavior is not specific to flow-based models. 
Suppose <math>P_{\theta}</math> is <math>N(0, I_d)</math>. 
Our typical example has <math>\Vert x_i \Vert^2 = O(d)</math>. 
Consider <math>x^{test} = 0</math> then <math>P_{\theta}(x^{test}) > P_{\theta}(x_1)</math>.
 
==Domain Adaptation==
So far, we have a training set <math>\{(x_i^{(train)}, y_i^{(train)})\}</math> from distribution <math>Q_{X,Y}</math>. 
We learn optimal parameters <math>\theta^*</math> via ERM. 
Then at test time, our test samples come from the same distribution <math>Q_{X,Y}</math>. 
However in practice, the training distribution can be different from the test distribution.  
The training distribution is the source domain. The test distribution is the target domain. 
 
;Examples
Q may be synthetic samples and P may be real samples. 
Q contains samples with white background but P has samples with real backgrounds.
 
In training: 
For the source domain, we have labeled samples <math>\{(x_i^S, y_i^S)\}_{i=1}^{m_S} \sim Q_{X,Y}</math>. 
For the target domain, we only have unlabeled samples <math>\{x_i^t\} \sim P_{X}</math>. 
This is ''unsupervised'' domain adaptation.
 
In ''semi-supervised'' domain adaptation, your target samples are mostly unlabeled but contain a few labeled samples.
 
If no target samples are available during training, the scenario is called ''domain generalization'' or ''out-of-distribution'' generalization.
 
===Unsupervised domain adaptation===
Given <math>m_s = m_t = m</math>. 
* m labeled samples from source domain Q
* m unlabeled samples from target P
 
* This problem is ill-defined
 
;Practical assumptions.
# Covariate shifts: P and Q satisfy the covariate shift assumption if the conditional label dist doesn't change between source and target.
#* I.e. <math>P(y|x) = Q(y|x)</math>
# Similarity of source and target marginal distributions.
# If I had labeled target samples, the joint error (target + source samples) should be small.
 
[Ben-David et al.] consider binary classification. 
;H-divergence
<math>2 \sup_{h \in H}|Pr_{x \sim Q_{X}}(h(x)=1) - Pr_{x \sim P_{X}}(h(x)=1)| = d_{H}(Q_X, P_X)</math>
 
;Lemma
<math>d_{H}(Q_X, P_X)</math> can be estimated by m samples from source and target domains. 
<math>VC(H)=d</math> then with probability <math>1-\delta</math> 
<math>d_{H}(Q_X, P_X) \leq d_{H}(Q_{X}^{(m)}, P_{X}^{(m)}) + 4 \sqrt{\frac{d \log(2m) + \log(2/\delta)}{m}}</math>
 
# Label source examples as 1 and label target samples as 0.
# Train a classifier to classify source and target.
# If loss is small then divergence is large. <math>\frac{1}{2} d_{H}(Q_X^{(m)}, P_X^{(m)}) = 1-loss_{class}</math>
 
===Recap===
Beginning of Lecture 18 (Oct. 29, 2020)
 
Given labeled examples from the source domain: <math>Q_{X,Y} = \{(x_i^S, y_i^S)\}_{i=1}^{m_s}</math>. 
Target domain: <math>P_{X} = \{x_i^t\}_{i=1}^{m_t}</math>. 
Learn a function <math>h \in H</math>. 
<math>\epsilon^T(h) = E_{(x,y) \sim P_{X,Y}}[ l(h(x), y) ]</math>.
 
H-divergence: 
<math>d_H(Q_X, P_X) = 2\sup_{h \in H} | P_{Q}(h(x)=1) - P_{P}(h(x)=1)| = 2(1-loss_{classifier})</math>. 
This can be estimated by training a classifier to distinguish between train and test samples.
 
Def: 
For the hypothesis class H, the ''symmetric difference hypothesis space'' <math>H \triangle H</math> is the set of disagreements between any two hypothesis in H: 
<math>H \triangle H = \{g(x) = h(x) \oplus h'(x) \mid \forall h, h' \in H\}</math>.
 
;Main Result
<math>H</math> is a hypothesis class with <math>VC(H)=d</math>. 
With probability <math>1-\delta</math>, <math>\forall h \in H</math>: 
<math>\epsilon_{T}(h) \leq \epsilon_{S}(h) + \frac{1}{2} d_{H \triangle H}(Q_X^{(m)}, P_X^{(m)}) + \epsilon_{joint}</math>. 
Target error is <= source error + divergence
 
===Practical Domain Adaptation Methods===
 
;Classical DA methods (pre deep learning)
* Metric Learning
* Subspace alignment
* MMD-based distribution matching
* Sample reweighting & selection
 
;Modern DA methods (i.e. deep domain adaption)
 
[https://arxiv.org/abs/1409.7495 Ganin & Lempitsky] 
The idea is to train an embedding function using an adversarial domain classifier to extract common features from the source and target domains.
* Input <math>x</math> goes to an embedding function <math>F</math> to get features. 
* Features go to a classification network <math>C_1</math> to get labels.
* Features also go to a domain classifier <math>C_2</math>.
* Training: <math>\min_{F, C_1, C_2} E_{(x, y)} \sim Q_{X,Y}^{(m)} [\ell(C_1 \circ F(x), y)] - \lambda L(C_2)</math>.
 
* In general, we want to find a mapping (embedding) <math>F</math> such that <math>F(Q_X) \approx F(P_X)</math>.
*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>.
 
Example 1: MMD distance (Maximum mean discrepancy) 
Define <math>\tilde{x}_i = F(x_i)</math>. 
<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math> 
Here <math>\phi: \mathbb{R}^r \to \mathbb{R}^D</math> is a fixed kernel function. 
We square D to apply the kernel trick: 
<math>D^2_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) = \frac{1}{m^2}\left( \sum_{i,j=1}^{m}K(x_i^s x_j^s) + \sum_{i,j=1}^{m}K(x_i^t, x_j^t) - 2 \sum_{i,j=1}^{m}K(x_i^t, x_j^t) \right)</math> 
MMD-based DA (Tzeng ''et al.'' 2014): 
<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math>
 
Example 2: Wasserstein distance 
<math>\min L_{cls}(C_1 \circ F(x^S), y^s) + \lambda W(F(x^s, F(x^s))</math> 
The wasserstein distance is computed using the Kantorovich duality. 
This is also called IPM (Integral prob. metrics) distance.
 
* We can also use improved & robust version of Wasserstein in DA.
** Robust Wasserstein [Balaji ''et al.'' Neurips 20]
** Normalized Wasserstein [....ICCV]
 
===CycleGAN===
[Zhu et al 2017] 
Another approach for image-to-image translation. 
 
Source: <math>(x^s, y^s)</math> 
Target: <math>x^t</math> 
Train two functions: <math>G_{S \to T}</math> and <math>G_{T \to S}</math>. 
Losses:
* <math>L_{GAN}(x^S, x^T, G_{S\to T}(D^T) = E_{x^t}\left[\log D^T(x^t)\right] + E\left[\log(1-D^T(G_{S\ to T}(x^s)) \right]</math>.
* <math>L_{GAN}(x^S, x^T, G_{T \to S}(D^S)</math>
* Cycle consistency: <math>L_{cyc} = E\left[ \Vert G_{T\to S}(G_{S \to T}(x^s)) - x^s \Vert \right] + E \left[ \Vert G_{S \to T}(G_{T\ to S}(x^t)) - x^t \Vert \right]</math>
 
Other tricks:
* Domain-specific batch norms
* Entropy based regularization
 
===Are assumptions necessary?===
Assumptions:
* Covariate shift
* <math>d_H(Q_x, P_x)</math> is small
* <math>\epsilon_{joint}</math> small
 
See [Ben-David ''et al.'']. 
* Covariate shift assumption is not sufficient.
* Necessity of small <math>d_{H}(P,Q)</math> for DA.
* Necessity of small joint training error.
 
===Domain Generalization===
Also known as out-of-dist (OOD) generalization.
 
Training: <math>|E|</math> training domains (envs) 
<math>P^{(e)} \sim \{(x_i^e, y_i^e)\}_{i=1}^{m_e}</math> with <math>1 \leq e \leq |E|</math>. 
Goal: Find <math>h \in H</math> that performs well in an unseen domain (domain <math>|E|+1</math>). 
At test time <math>P^{(K+1)} \sim \{(x_i^{(k+1)}, y_i^{(k+1)})\}_{i=1}^{m_{(k+1)}} = E[\ell(h(x),y)]</math>. 
<math>R^{(k+1)}(h) = E_{(x,y) \sim P^{(k+1)}}[\ell(h(x), y)]</math>.
 
;Example datasets
* [http://ai.bu.edu/M3SDA/ DomainNet]
** 6 domains: painting, photo, sketches, drawings, clipart, infograph
* PACS
** 4 domains: art, cartoon, photo, sketch
* Some Simpler datasets
** Rotated MNIST
** Color MNIST
 
One generalization method is to do nothing, just training normally with ERM.
 
===Domain-adversarial neural networks (DANN)===
* Train a feature extractor <math>\phi</math> and a classifier <math>w</math> to yield <math>f=w \circ \phi</math>.
* Domain classifier <math>c</math>.
* <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda L(domain\_classification)</math>.
* We optimize <math>\min_{\phi, w} loss</math>
* We can use Wasserstein distance:
** <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda \sum w(P_{X|Y}^{j_1}, P_{X|Y}^{j_2})</math>
** This is solved using alternating gradient descent.
 
===Meta Learning for Domain Generalization===
[Li ''et al.'' 2017]
 
Idea: Build meta-test domains 
;Meta-train
* Loss: <math>L_{meta\_train(\theta) = \frac{1}{K-K_1} \sum_{j} E[\ell(f_{\theta'}(x), y)]</math>
* Take one-gradient step to update the model:
** <math>\phi' = \phi - \eta \nabla L_{metatrain}(\theta)</math>
 
Overall objective:
* <math>\min_{\theta} L_{\text{meta-train}}(\theta) + \beta L_{\text{meta-test}}(\theta')</math>
 
To update <math>L_{meta}(\theta)</math>, we need to compute <math>\nabla L_{meta}(\theta)</math> which depends on the Hessian wrt <math>\theta</math>. This can be solved using a ''hessian-vector product'' without computing out the hessian which could be very large.
 
===Invariant Risk Minimization (IRM)===
[Arjovsky ''et al.'' (2019)]
 
Idea: Find a feature extractor <math>\phi()</math> such that optimal classifier is the same for every domain. 
 
Define: <math>R^e(\phi, w) = E[\ell (w_0 \phi(x), y)]</math>. 
Objective: <math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e (\phi, w)</math> s.t. <math>\forall e</math>, <math>\hat{w} \in \operatorname{argmin}_{\beta} R^e(\phi, \beta)</math>
 
This is a bi-level optimization which is difficult to solve. The constraint depends on another optimization.
 
The paper uses a lagrangian relaxation:
<math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e(\phi, \hat{w}) + \lambda \Vert \nabla_{\hat{w}} R^e(\phi, \hat{w}) \Vert^2_2</math>.
 
Argument: If we can solve the optimization, such a function will use only invariant features since non-invariant features will have different conditional distribution with the label.
 
[Rosenfeld ''et al.'' Oct 2020] 
Not a valid argument in general as IRM fails to recover invariant predictors.
 
===Which method for generalization works the best?===
[Gultajani & Lopez-Poz] offer the following empirical observations:
* Model selection is critical in domain generalization.
 
* Training-domain validation set
* Leave-one-domain out validation set
* Oracle selection
 
;Data augmentation is important in domain generalization
* random crops, random horizontal flips, random color jitter
* image-to-image neural translation
 
;ERM (doing nothing) outperforms other DG methods!!
* Possible with careful model selection
* Larger models help with domain generalization.
* See DomainBED for code and data.
 
==Self-supervised Learning==
Lecture 21 (November 10, 2020)
 
Given data <math>x</math>, we want to find a good representation <math>f(x)</math>. 
We can use <math>f(x)</math>. to solve the classification problem more efficiently (e.g. using linear classifiers).
 
Task 1: Learn a good <math>f(x)</math> from ''unlabeled'' samples. 
Task 2: Use <math>f(x)</math> + a few labels to solve classification problem using linear models.
 
Note that in semi-supervised learning, you have unlabeled examples and a few labelled examples but you know what the task is. 
In self-supervised learning, we use ''structure'' in labelled data to create artificial supervised learning problems solved via deep models. 
In this process, the learning method ''hopefully'' will create internal representations for the data <math>f(x)</math> useful for downstream tasks.
 
===Image embedding===
Surprising observation for image embedding: 
[Gidaris ''et al.'' ICLR 2018] + [Zhang et al. 2019] 
# Rotate images and use the angle of rotation as labels (e.g. <math>\theta = 0, 90, 180, 270</math>).
# Train a CNN to predict the rotation angle from images.
# Use <math>f(x)</math> with linear classification models for the true labels.
;Why should <math>f(x)</math> be a good representation for images?
This is an open question.
 
===Contrastive Learning===
[Logeswaren & Lee ICLR 2018] use a text corpus (Wikipedia) to train deep representation <math>f(x)</math>:  
<math>x, x^{+}</math> are adjacent sentences. 
<math>x, x^{-}</math> are random sentences.
 
Optimization:
<math>\min_{f} E[\log(1 + \exp^{f(x)^T f(x^-) - f(x)^T f(x^+)]})] \approx E[f(x)^T f(x^-) - f(x)^T f(x^+)]</math>.
This is known as contrastive learning.
 
Sentence embeddings capture human notions of similarities: 
E.g. the tiger rules this jungle is similar to a lion hunts in a forest.
 
;Can we use contrastive learning to obtain power data representations for images? 
We need pairs of similar images and dissimilar images.
 
;SimCLR [Chen ''et al.'' 2020]
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
#* Random cropping + resize
#* Random color distortion
#* Random Gaussian blur
# Use base encoder (ResNet) to map <math>\tilde{x}_i,\tilde{x}_j</math> to embeddings <math>h_i, h_j</math>.
# Train a project head <math>g()</math> (one hidden layer MLP) to map h's to z's which maximize the agreement between z's.
# Loss function: <math>sim(z_i, z_j) = \frac{z_i^t z_j}{\Vert z_i \Vert \Vert z_j \Vert}</math>
 
Randomly select <math>N</math> samples and add their augmentations to get 2N samples. 
Compute similarity matrix <math>S \in \mathbb{R}^{2N \times 2N}</math>. 
<math>S_{ij}=\exp(sim(z_i, z_j)) =
\begin{cases}
1 & \text{if }i=j\\
high & \text{if }{j=2k\text{ and }i=2k-1}\\
low & otherwise
\end{cases}
</math>
 
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
 
Practical observations:
* Composition of data augmentation is important.
* Larger batch sizes and longer training helps self-supervised learning!
* Optimization over the MLP projection head <math>g</math> helps!
** f is able to keep some information useful for the classification (e.g. color or orientation)
Empirical results:
* For image net, top-1 accuracy increases as number of parameters increases.
* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training.
 
===Theory of self-supervised learning===
;[Arora ''et al.'' 2019]
 
Modeling of similar pairs:
* <math>x \sim D_{C1}(x)</math>
* <math>x^+ \sim D_{C1}(x)</math> is semantically similar
* <math>x^- \sim D_{C2}(x)</math> negative samples
Downstream task:
* pairwise classification
** nature picks two classes C1, C2
** generate samples from C1 & C2 and evaluate the classification loss
* Assume <math>m \to \infty</math> so just look at population loss
 
Notations:
* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer
* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss
 
Result 1:
* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math>
** <math>\tau</math> is the collision probability for pair of random classes.
 
Idea result:
* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>.
* In general it is impossible to get this.
 
 
;[Tosh et al. 2020]
This work connects self-supervised learning with multi-view representation theory. 
Start with (x, z, y) where x,z are two views of the data and y is the label. 
* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both.
* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small
* Formulate contrastive learning as an artificial ''classification'' problem:
** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>.
** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>.
 
* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good
<math>
\begin{aligned}
\mu(x) &= E[E[Y|Z]|X=x] \\
&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\
&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz
\end{aligned}
</math>
 
Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math>
 
Landmark embedding: <math>g^*</math> is computed via contrastive learning. 
How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math> 
z's are randomly generated from <math>P_Z</math>. 
Each <math>g^*</math> output is a single number.
 
* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>.
** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math>
** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math>
 
Direct embedding 
Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>. 
 
Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>. 
Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>? 
* If there is a hidden variable H s.t. X and Z are conditionally independent given H.
# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math>
# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>.
 
==Meta Learning==
In many applications, we don't have a large training dataset. However, humans can adapt and learn ''on the fly''. 
The key is to use prior knowledge in performing new tasks. 
How can we train AI/ML models similarly? 
Goal of meta learning: Train a model on different learning tasks such that it can solve new tasks using only a small number of training samples.
 
Few shot classification problem: 
Inputs: <math>D_{metatrain} = \{(D_i^{train}, D_i^{test})\}_{i=1}^{n}</math>. 
<math>
\begin{cases}
D_{i}^{train} = \{(x_j^i, y_j^i) \}_{j=1}^{k}\\
D_{i}^{test} = \{(x_j^i, y_j^i) \}_{j=1}^{k'}
\end{cases}
</math>
* <math>i</math> is the index of the task
* <math>j</math> is the index of the sample
 
K-shot classification means k training samples per label. Some papers use k as the size of the whole training set. 
Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively?
 
* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that:
* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>.
* Training procedure:
** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>.
* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>.
 
[Fin et al. 2017] 
* Idea: train a model that can be easy to fine-tune on new tasks
* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math>
* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math>
* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math>
 
1. Model-agnostic meta learning (MAML)
* Use GD to optimize over <math>\theta</math>
** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math>
** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive.
* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix.
 
Reptile (Michal et al 2018)
2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math>
 
[Lee et al. 2019]
* <math>f_{\theta}</math> is used to compute embeddings
* <math>A</math> is a linear classifier (e.g. SVM)
* Use dual form of SVM so # of optimization variables = # training samples * # classes
 
3. <math>A</math> is a non-parametric learning
* Embedding: <math>\tilde{x} = f_{\theta}(x)</math>
[Snell et al. 2017)
* Define prototypes (cluster centers)
** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math>
** <math>P_{\theta}(y=k|x) = \frac{\exp(-d(f_{\theta}(x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math>
 
4: <math>A</math> is a black box (e.g. LSTM).
 
==LSTMs and transformers==
Lecture 23 (November 19, 2020)
 
===Recurrent Neural Networks (RNNs)===
Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math> 
Prediction at t: <math>y_t = W_{hy} h_{t}</math>
 
;Backpropagation through time
If <math>W</math> has largest singular value < 1, then gradient vanishes. 
If <math>W</math> has largest singular value > 1, then gradient explodes. 
Typically, gradient vanishes because of initialization of <math>W</math>.
 
===Long Short Term Memory (LSTMs)===
Goal is to solve the vanishing and exploding gradient problem.
 
LSTM has several gates:
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
* Forget gate: <math>f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)</math>
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math>
 
 
;Bidirectional RNNs
First LSTM takes input in correct order. 
Second LSTM takes input in reverse order. 
Concatenate outputs from both LSTMs.
 
===Attention===
Goal: To help memorize long source sentences in machine translation.
 
;Encoder-decoder attention
 
;Self-Attention
 
===Transformer===
;Positional encoding
 
;Self-Attention
Have queries, keys, and values. 
Multiply queries with keys, pass through softmax. Then times values. 
Yields attention of every work with respect to another. 
Initially, transformers had n=8 heads giving 8 queries, keys, and values.
 
;Architecture
Stack encoders.
 
==Interpretability==
 
;Interpretability Methods
* Built-in model interpretability
* Feature level interpretability
* Instance based explanations
We will focus on feature level interpretability.
 
===Feature Level Interpretability===
These are given through saliency maps. 
* Perturbation-based: Perturb the input to get another output and compute the difference. 
* Gradient-based
 
===Gradient-based Methods===
Take the derivative of the output with respect to the input.
 
;Limitations
* Too local and sensitive to slight perturbations
* Saturated outputs lead to unintuitive gradients
* Discontinuous gradients are problematic
 
;SmoothGrad
* Add gaussian noise to input and average the gradient.
 
;Integrated Gradients
* Average the gradients along path from baseline to input.
 
;DeepLift
* We don't care about gradient but the slope relative to the ''reference'' state
 
;Limitations
* Models must be able to compute the gradient of the output with respect to the input
* Interpretation of neural networks is fragile
** Saliency maps are uninterpretable for adversarial examples on clean models and adversarially trained models.
* Needs white-box gradient access to the model.
 
===Evaluation of interpretability methods===
* Human evaluation
** Can humans evaluate saliency?
* Accuracy drop after removing ''salient'' features
* Sanity checks
** Model parameter randomization test - compare output of saliency method on trained vs untrained method to make sure saliency depends on model parameters.
* Synthetic Data
* Data randomization test
** Train on random labels and see if saliency depends on relationship between input & output.
 
Temporal saliency Rescaling
* If you remove this feature, how is the gradient going to change.
 
==Reinforcement Learning==
Lecture 27 (December 1, 2020)
 
;Goal: Sequential decision making problems
* Current decisions can affect future decisions
Examples: games, robotics, self-driving, finance
 
Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>. 
 
;Definitions
* <math>S</math> is the state space. This can be discrete or continuous.
* <math>O</math> are the observations.
* <math>A</math> is the set of actions which can be discrete or continuous.
* <math>P</math> are the transition probabilities which model the environment.
 
Example:
* <math>|S|</math> = 2
* <math>|A|</math> = 2
In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>. 
The given variables are the trajectory <math>T_{t-1}</math>. 
Modeling this can be very complex for large <math>T</math>. 
Thus, we apply a markov assumption:
* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math>
 
Objective: To maximize the rewards
* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math>
Two regimes for rewards
* Finite horizon; T is finite
* Infinite horizon; T goes to infinity
* Discount factor
 
===Classical RL===
Finite Markov Decision Process (MDP)
* S: Finite number of states
* A: Finite number of actions
* <math>R_t = R(a_t, S_t)</math>
* r discount factor
 
Goal: Choose actions to maximize the total rewards. 
* Policy function <math>\pi</math> determines how actions should be taken.
* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions.
* Assume this is time invariant.
* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>.
* Q function (action-value function)
** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math>
* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>?
** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math>
* Use Bellman's equation
** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math>
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. 
All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1. 
This implies <math>I - \gamma P_{\pi}</math> is always invertible. 
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
 
===Value-Iteration (Bellman Recursion)===
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> 
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> 
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. 
 
;Claim: <math>L_{\pi}</math> is a contraction.
Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>. 
<math>
\begin{aligned}
L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\
&= \gamma P_{\pi}(v_1 - v_2)
\end{aligned}
</math> 
<math>
\begin{aligned}
\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\
& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\
& \leq \Vert v_1 - v_2 \Vert
\end{aligned}
</math> 
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
 
===Optimal Policy===
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
 
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. 
The policy may not be unique but if two policies are equal, they have the same value function.
 
Intermediate questions:
* Given <math>v^*</math>, can we compute <math>Q^*</math>?
<math>
\begin{aligned}
Q^*(s, a) &= \max_{\pi} Q(s,a)\\
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
\end{aligned}
</math>
 
* Given <math>Q^*</math>, can we compute <math>v^*</math>.
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. 
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
 
;How to compare optimal policies?
First approach: Value Iteration
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> 
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> 
<math>v^*</math> is a fixed point of <math>L(v)</math>. 
Claim: <math>L</math> is also a contraction. 
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. 
 
===Value Iteration===
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
 
===Policy Iteration===
Directly optimize policy.
 
* Start with a random policy <math>\pi^{(0)}</math>
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
* Find the best action to be taken at state <math>s</math>:
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
* Repeat until convergence.
 
Policy iteration is guaranteed to converge to an optimal policy. 
Oftentimes converges faster than value iteration.
 
===Deep Reinforcement Learning===
 
;Relaxing some unrealistic assumptions
# Evaluate <math>v_{\pi}(s)</math>
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>
# Improve the policy
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>
#* Assumption <math>|S|=d</math>
# How to represent <math>V(s)</math>?
#* Can we use a neural network to represent <math>V(s)</math>? Yes
 
;How to train <math>v_{\phi}</math>?
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. 
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> 
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: 
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> 
* Iterate
 
;Similarly we can parameterize the Q function
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> 
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).
 
With model-free RL: 
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> 
This is called ''Q-Learning''.
 
;What if we have continuous actions:
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve
** [Gu ''et al.'' 2016] use quadratic functions.
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
 
===Policy Gradient Method===
Lecture 29 (Dec 8, 2020) 
 
Probability of observing a trajectory: 
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> 
Reward for a trajectory: 
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> 
The average reward is: 
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
 
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
 
Gradient of the average reward: 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\
&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\
&= E[\log P_\theta(\tau) R(\tau)]
\end{aligned}
</math>
 
<math>
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
</math> 
Implies, 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &= E[...]\\
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
\end{aligned}
</math>
 
;Summary
* Sample trajectories
* Approximate <math>\nabla_{\theta} J(\theta)</math>
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
 
;Intuition
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> 
Formalizing ''trial & error''. 
[Finn & Levin, ICML]
 
===Issues with Policy Gradient===
;High variance of gradient estimation
 
;Solutions
* Subtract a baseline
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
* Reward-to-go
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
\end{aligned}
</math>
 
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
 
===Actor-critic algorithms===
Have an actor <math>\pi_{\theta}</math>. 
Have a critic <math>V_{\phi}/Q</math>
 
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
 
===Other topics in RL===
* Inverse RL
* Multi-agent RL
* Model-based RL
 
==Summary of Course==
;What we covered
* Supervised DL
* Unsupervised DL (GANs, VAEs)
* Self-supervised DL
* Meta-Learning
* Learning with Attention (Transformers)
* Deep RL
 
* Optimization
* Generalization
* Robustness
* Interpretability
 
;What we didn't cover
* Fairness
* Privacy & Ethics
* Bayesian DL
* Federated Learning
* Graph NNs


==Misc==
;Things which may be on the final
[[Visible to::users]]
* Transformers
* Wasserstein distance
* Kernel methods


==Resources==
==Resources==