@@ Line 2: / Line 2: @@
 * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
+* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]
 ==Basics==
-A refresher of [[Machine Learning]] and Supervised Learning.
+A refresher of [[Machine Learning]] and [[Supervised Learning]].
 ===Empirical risk minimization (ERM)===
@@ Line 93: / Line 94: @@
 \begin{aligned}
 \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
-&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
+&=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
 &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
 &= \lambda_{\min}(K(w)) L(w)\\
@@ Line 502: / Line 503: @@
 ===Linear Regression===
-Assume we have a dataset:
+Assume we have a dataset:<br>
 <math>\{(x_i, y_i)\}_{i=1}^{n}</math>
-<math>y_i \in \mathbb{R}</math>
+<math>y_i \in \mathbb{R}</math><br>
-<math>x_i \in \mathbb{R}^d</math>
+<math>x_i \in \mathbb{R}^d</math><br>
 <math>f(w, x) = w^t x</math>
-<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
+<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
-<math>\min_{W} L(w)</math>
+<math>\min_{W} L(w)</math><br>
-GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
+GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
 <math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
@@ Line 532: / Line 533: @@
 </math>
-<math>f(w, x) = w^t \phi(x)</math>
+<math>f(w, x) = w^t \phi(x)</math><br>
-Is this model linear in w? Yes!
+Is this model linear in w? Yes!<br>
-Is this model linear in x? No!
+Is this model linear in x? No!<br>
-<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
+<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
 Apply GD or convex optimization.
@@ Line 542: / Line 543: @@
 * <math>\phi</math> is fixed!
 * <math>D = O(d^k)</math>
-** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
+** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>
 ;Kernel Trick:
-We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
+We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
-This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
+This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
-K is a PSD matrix.
+<math display="inline">K</math> is a PSD matrix.
 Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
 ;Polynomial Kernels
-<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
+<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
 Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
 Many classical techniques can be ''kernelized'':
-SVM to Kernel SVM
+* SVM to Kernel SVM
-Ridge regression to Kernel ridge regression
+* Ridge regression to Kernel ridge regression
-PCA to Kernel PCA
+* PCA to Kernel PCA
 ===Neural Networks===
-Consider a two-layer neural network.
+Consider a two-layer neural network.<br>
-We can write the output as:
+We can write the output as:<br>
-<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
+<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
-We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
+We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
 GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
-Init N(0,1)
+# Init N(0,1)
-Our weights update along a trajectory: w(0), w(1), ...
+# Our weights update along a trajectory: w(0), w(1), ...
-Each <math>w</math> is a weight matrix.
+# Each <math>w</math> is a weight matrix.
 Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
 This is called ''lazy'' training.
 * Not always the case! Especially for small <math>m</math>.
-Since the change in the model weights are not large, we can write the first-order taylor approximation:
+Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
-<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
+<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
-This model is linear in <math>w</math>.
+This model is linear in <math>w</math>.<br>
-<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
+<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
 The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
-Go back to our 2-layer NN:
+These features will not change during the optimization process because we use <math display="inline">w_0</math>
-<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
-<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
+Go back to our 2-layer NN:<br>
+<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
+<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
 <math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
-<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
+<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
-<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
+<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
 <math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
 * <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
-Based on law of large numbers, as m goes to infinity,
+Based on law of large numbers, as m goes to infinity,<br>
-<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
+<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
 <math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
-<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
+<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
 <math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
-;Q: When is this taylor approximation good?
+;Q: When is this taylor approximation good?<br>
 If the Hessian has bounded eigenvalues. (Hessian Control)
 ;Analyze GD:
-<math>\eta \to 0</math> Gradient-flow
+<math>\eta \to 0</math> Gradient-flow<br>
-<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
+<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
 <math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
-<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
+<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
 <math>
 \begin{aligned}
@@ Line 615: / Line 618: @@
 </math>
-If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
+If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
-This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
+This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
 In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
@@ Line 1,419: / Line 1,422: @@
 An <math>f</math> function which is element-wise would have a diagonal jacobian. This is not very expressive.
+===Upper-triangular Jacobian===
 RealNVP by [Dint et al] considers an upper-triangular matrix.
 In this case, the determinant is still the diagonal.
@@ Line 1,435: / Line 1,439: @@
 \end{pmatrix}
 </math>
 and <math>det(J) = \prod (S_\theta)_i</math>.
+Is this expressive enough? No.
+===Composition of transformations===
+<math>f = f_1 \circ f_2 \circ ... \circ f_k</math>
+A sequence of invertible transformations is called normalizing flows.
+Here, we have the following properties
+* <math>(f_1 \circ f_2)^{-1} = f_2^{-1} \circ f_1^{-1}</math>
+* <math>\nabla(f_2 \circ f_1)(x) = \nabla f_2(f_1(x)) \nabla f_1(x)</math>
+* <math>det(J_1 J_2) = det(J_1) det(J_2)</math>
+RealNVP uses two types of partitioning
+* Checkerboard
+* Channel partitioning
+In Glow [Kingmen et al], they use invertible 1x1 convolution which is equivalent to matrix multiplication.
+Use <math>W = P L (U+diag(s))</math> and <math>| det(W)| = sum(\log |S|)</math>
+===Dequantization===
+Fitting a ''density'' function to discrete data can have crazy peaks.
+For dequantization, add a uniform to the data to get a more stable density function.
+<math>
+\begin{aligned}
+\log \int p(x_i +\delta) p(\delta)ds &= \log E_\delta [\log p(x_i + \delta)]\\
+&\geq E_{\delta}[\log p(x_i + \delta)]\\
+&\approx \log p(x_i + \delta)
+\end{aligned}
+</math>
+* We have exact likelihood estimations.
+** We can use out-of-distribution anomaly detection.
+However in practice after training on CIFAR, the liklihood of MNIST is higher.
+This behavior is not specific to flow-based models.
+Suppose <math>P_{\theta}</math> is <math>N(0, I_d)</math>.
+Our typical example has <math>\Vert x_i \Vert^2 = O(d)</math>.
+Consider <math>x^{test} = 0</math> then <math>P_{\theta}(x^{test}) > P_{\theta}(x_1)</math>.
+==Domain Adaptation==
+So far, we have a training set <math>\{(x_i^{(train)}, y_i^{(train)})\}</math> from distribution <math>Q_{X,Y}</math>.
+We learn optimal parameters <math>\theta^*</math> via ERM.
+Then at test time, our test samples come from the same distribution <math>Q_{X,Y}</math>.
+However in practice, the training distribution can be different from the test distribution.
+The training distribution is the source domain. The test distribution is the target domain.
+;Examples
+Q may be synthetic samples and P may be real samples.
+Q contains samples with white background but P has samples with real backgrounds.
+In training:
+For the source domain, we have labeled samples <math>\{(x_i^S, y_i^S)\}_{i=1}^{m_S} \sim Q_{X,Y}</math>.
+For the target domain, we only have unlabeled samples <math>\{x_i^t\} \sim P_{X}</math>.
+This is ''unsupervised'' domain adaptation.
+In ''semi-supervised'' domain adaptation, your target samples are mostly unlabeled but contain a few labeled samples.
+If no target samples are available during training, the scenario is called ''domain generalization'' or ''out-of-distribution'' generalization.
+===Unsupervised domain adaptation===
+Given <math>m_s = m_t = m</math>.
+* m labeled samples from source domain Q
+* m unlabeled samples from target P
+* This problem is ill-defined
+;Practical assumptions.
+# Covariate shifts: P and Q satisfy the covariate shift assumption if the conditional label dist doesn't change between source and target.
+#* I.e. <math>P(y|x) = Q(y|x)</math>
+# Similarity of source and target marginal distributions.
+# If I had labeled target samples, the joint error (target + source samples) should be small.
+[Ben-David et al.] consider binary classification.
+;H-divergence
+<math>2 \sup_{h \in H}|Pr_{x \sim Q_{X}}(h(x)=1) - Pr_{x \sim P_{X}}(h(x)=1)| = d_{H}(Q_X, P_X)</math>
+;Lemma
+<math>d_{H}(Q_X, P_X)</math> can be estimated by m samples from source and target domains.
+<math>VC(H)=d</math> then with probability <math>1-\delta</math>
+<math>d_{H}(Q_X, P_X) \leq d_{H}(Q_{X}^{(m)}, P_{X}^{(m)}) + 4 \sqrt{\frac{d \log(2m) + \log(2/\delta)}{m}}</math>
+# Label source examples as 1 and label target samples as 0.
+# Train a classifier to classify source and target.
+# If loss is small then divergence is large. <math>\frac{1}{2} d_{H}(Q_X^{(m)}, P_X^{(m)}) = 1-loss_{class}</math>
+===Recap===
+Beginning of Lecture 18 (Oct. 29, 2020)
+Given labeled examples from the source domain: <math>Q_{X,Y} = \{(x_i^S, y_i^S)\}_{i=1}^{m_s}</math>.
+Target domain: <math>P_{X} = \{x_i^t\}_{i=1}^{m_t}</math>.
+Learn a function <math>h \in H</math>.
+<math>\epsilon^T(h) = E_{(x,y) \sim P_{X,Y}}[ l(h(x), y) ]</math>.
+H-divergence:
+<math>d_H(Q_X, P_X) = 2\sup_{h \in H} | P_{Q}(h(x)=1) - P_{P}(h(x)=1)| = 2(1-loss_{classifier})</math>.
+This can be estimated by training a classifier to distinguish between train and test samples.
+Def:
+For the hypothesis class H, the ''symmetric difference hypothesis space'' <math>H \triangle H</math> is the set of disagreements between any two hypothesis in H:
+<math>H \triangle H = \{g(x) = h(x) \oplus h'(x) \mid \forall h, h' \in H\}</math>.
+;Main Result
+<math>H</math> is a hypothesis class with <math>VC(H)=d</math>.
+With probability <math>1-\delta</math>, <math>\forall h \in H</math>:
+<math>\epsilon_{T}(h) \leq \epsilon_{S}(h) + \frac{1}{2} d_{H \triangle H}(Q_X^{(m)}, P_X^{(m)}) + \epsilon_{joint}</math>.
+Target error is <= source error + divergence
+===Practical Domain Adaptation Methods===
+;Classical DA methods (pre deep learning)
+* Metric Learning
+* Subspace alignment
+* MMD-based distribution matching
+* Sample reweighting & selection
+;Modern DA methods (i.e. deep domain adaption)
+[https://arxiv.org/abs/1409.7495 Ganin & Lempitsky]
+The idea is to train an embedding function using an adversarial domain classifier to extract common features from the source and target domains.
+* Input <math>x</math> goes to an embedding function <math>F</math> to get features.
+* Features go to a classification network <math>C_1</math> to get labels.
+* Features also go to a domain classifier <math>C_2</math>.
+* Training: <math>\min_{F, C_1, C_2} E_{(x, y)} \sim Q_{X,Y}^{(m)} [\ell(C_1 \circ F(x), y)] - \lambda L(C_2)</math>.
+* In general, we want to find a mapping (embedding) <math>F</math> such that <math>F(Q_X) \approx F(P_X)</math>.
+*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>.
+Example 1: MMD distance (Maximum mean discrepancy)
+Define <math>\tilde{x}_i = F(x_i)</math>.
+<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math>
+Here <math>\phi: \mathbb{R}^r \to \mathbb{R}^D</math> is a fixed kernel function.
+We square D to apply the kernel trick:
+<math>D^2_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) = \frac{1}{m^2}\left( \sum_{i,j=1}^{m}K(x_i^s x_j^s) + \sum_{i,j=1}^{m}K(x_i^t, x_j^t) - 2 \sum_{i,j=1}^{m}K(x_i^t, x_j^t) \right)</math>
+MMD-based DA (Tzeng ''et al.'' 2014):
+<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math>
+Example 2: Wasserstein distance
+<math>\min L_{cls}(C_1 \circ F(x^S), y^s) + \lambda W(F(x^s, F(x^s))</math>
+The wasserstein distance is computed using the Kantorovich duality.
+This is also called IPM (Integral prob. metrics) distance.
+* We can also use improved & robust version of Wasserstein in DA.
+** Robust Wasserstein [Balaji ''et al.'' Neurips 20]
+** Normalized Wasserstein [....ICCV]
+===CycleGAN===
+[Zhu et al 2017]
+Another approach for image-to-image translation.
+Source: <math>(x^s, y^s)</math>
+Target: <math>x^t</math>
+Train two functions: <math>G_{S \to T}</math> and <math>G_{T \to S}</math>.
+Losses:
+* <math>L_{GAN}(x^S, x^T, G_{S\to T}(D^T) = E_{x^t}\left[\log D^T(x^t)\right] + E\left[\log(1-D^T(G_{S\ to T}(x^s)) \right]</math>.
+* <math>L_{GAN}(x^S, x^T, G_{T \to S}(D^S)</math>
+* Cycle consistency: <math>L_{cyc} = E\left[ \Vert G_{T\to S}(G_{S \to T}(x^s)) - x^s \Vert \right] + E \left[ \Vert G_{S \to T}(G_{T\ to S}(x^t)) - x^t \Vert \right]</math>
+Other tricks:
+* Domain-specific batch norms
+* Entropy based regularization
+===Are assumptions necessary?===
+Assumptions:
+* Covariate shift
+* <math>d_H(Q_x, P_x)</math> is small
+* <math>\epsilon_{joint}</math> small
+See [Ben-David ''et al.''].
+* Covariate shift assumption is not sufficient.
+* Necessity of small <math>d_{H}(P,Q)</math> for DA.
+* Necessity of small joint training error.
+===Domain Generalization===
+Also known as out-of-dist (OOD) generalization.
+Training: <math>|E|</math> training domains (envs)
+<math>P^{(e)} \sim \{(x_i^e, y_i^e)\}_{i=1}^{m_e}</math> with <math>1 \leq e \leq |E|</math>.
+Goal: Find <math>h \in H</math> that performs well in an unseen domain (domain <math>|E|+1</math>).
+At test time <math>P^{(K+1)} \sim \{(x_i^{(k+1)}, y_i^{(k+1)})\}_{i=1}^{m_{(k+1)}} = E[\ell(h(x),y)]</math>.
+<math>R^{(k+1)}(h) = E_{(x,y) \sim P^{(k+1)}}[\ell(h(x), y)]</math>.
+;Example datasets
+* [http://ai.bu.edu/M3SDA/ DomainNet]
+** 6 domains: painting, photo, sketches, drawings, clipart, infograph
+* PACS
+** 4 domains: art, cartoon, photo, sketch
+* Some Simpler datasets
+** Rotated MNIST
+** Color MNIST
+One generalization method is to do nothing, just training normally with ERM.
+===Domain-adversarial neural networks (DANN)===
+* Train a feature extractor <math>\phi</math> and a classifier <math>w</math> to yield <math>f=w \circ \phi</math>.
+* Domain classifier <math>c</math>.
+* <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda L(domain\_classification)</math>.
+* We optimize <math>\min_{\phi, w} loss</math>
+* We can use Wasserstein distance:
+** <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda \sum w(P_{X|Y}^{j_1}, P_{X|Y}^{j_2})</math>
+** This is solved using alternating gradient descent.
+===Meta Learning for Domain Generalization===
+[Li ''et al.'' 2017]
+Idea: Build meta-test domains
+;Meta-train
+* Loss: <math>L_{meta\_train(\theta) = \frac{1}{K-K_1} \sum_{j} E[\ell(f_{\theta'}(x), y)]</math>
+* Take one-gradient step to update the model:
+** <math>\phi' = \phi - \eta \nabla L_{metatrain}(\theta)</math>
+Overall objective:
+* <math>\min_{\theta} L_{\text{meta-train}}(\theta) + \beta L_{\text{meta-test}}(\theta')</math>
+To update <math>L_{meta}(\theta)</math>, we need to compute <math>\nabla L_{meta}(\theta)</math> which depends on the Hessian wrt <math>\theta</math>. This can be solved using a ''hessian-vector product'' without computing out the hessian which could be very large.
+===Invariant Risk Minimization (IRM)===
+[Arjovsky ''et al.'' (2019)]
+Idea: Find a feature extractor <math>\phi()</math> such that optimal classifier is the same for every domain.
+Define: <math>R^e(\phi, w) = E[\ell (w_0 \phi(x), y)]</math>.
+Objective: <math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e (\phi, w)</math> s.t. <math>\forall e</math>, <math>\hat{w} \in \operatorname{argmin}_{\beta} R^e(\phi, \beta)</math>
+This is a bi-level optimization which is difficult to solve. The constraint depends on another optimization.
+The paper uses a lagrangian relaxation:
+<math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e(\phi, \hat{w}) + \lambda \Vert \nabla_{\hat{w}} R^e(\phi, \hat{w}) \Vert^2_2</math>.
+Argument: If we can solve the optimization, such a function will use only invariant features since non-invariant features will have different conditional distribution with the label.
+[Rosenfeld ''et al.'' Oct 2020]
+Not a valid argument in general as IRM fails to recover invariant predictors.
+===Which method for generalization works the best?===
+[Gultajani & Lopez-Poz] offer the following empirical observations:
+* Model selection is critical in domain generalization.
+* Training-domain validation set
+* Leave-one-domain out validation set
+* Oracle selection
+;Data augmentation is important in domain generalization
+* random crops, random horizontal flips, random color jitter
+* image-to-image neural translation
+;ERM (doing nothing) outperforms other DG methods!!
+* Possible with careful model selection
+* Larger models help with domain generalization.
+* See DomainBED for code and data.
+==Self-supervised Learning==
+Lecture 21 (November 10, 2020)
+Given data <math>x</math>, we want to find a good representation <math>f(x)</math>.
+We can use <math>f(x)</math>. to solve the classification problem more efficiently (e.g. using linear classifiers).
+Task 1: Learn a good <math>f(x)</math> from ''unlabeled'' samples.
+Task 2: Use <math>f(x)</math> + a few labels to solve classification problem using linear models.
+Note that in semi-supervised learning, you have unlabeled examples and a few labelled examples but you know what the task is.
+In self-supervised learning, we use ''structure'' in labelled data to create artificial supervised learning problems solved via deep models.
+In this process, the learning method ''hopefully'' will create internal representations for the data <math>f(x)</math> useful for downstream tasks.
+===Image embedding===
+Surprising observation for image embedding:
+[Gidaris ''et al.'' ICLR 2018] + [Zhang et al. 2019]
+# Rotate images and use the angle of rotation as labels (e.g. <math>\theta = 0, 90, 180, 270</math>).
+# Train a CNN to predict the rotation angle from images.
+# Use <math>f(x)</math> with linear classification models for the true labels.
+;Why should <math>f(x)</math> be a good representation for images?
+This is an open question.
+===Contrastive Learning===
+[Logeswaren & Lee ICLR 2018] use a text corpus (Wikipedia) to train deep representation <math>f(x)</math>:
+<math>x, x^{+}</math> are adjacent sentences.
+<math>x, x^{-}</math> are random sentences.
+Optimization:
+<math>\min_{f} E[\log(1 + \exp^{f(x)^T f(x^-) - f(x)^T f(x^+)]})] \approx E[f(x)^T f(x^-) - f(x)^T f(x^+)]</math>.
+This is known as contrastive learning.
+Sentence embeddings capture human notions of similarities:
+E.g. the tiger rules this jungle is similar to a lion hunts in a forest.
+;Can we use contrastive learning to obtain power data representations for images?
+We need pairs of similar images and dissimilar images.
+;SimCLR [Chen ''et al.'' 2020]
+# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>.
+#* Random cropping + resize
+#* Random color distortion
+#* Random Gaussian blur
+# Use base encoder (ResNet) to map <math>\tilde{x}_i,\tilde{x}_j</math> to embeddings <math>h_i, h_j</math>.
+# Train a project head <math>g()</math> (one hidden layer MLP) to map h's to z's which maximize the agreement between z's.
+# Loss function: <math>sim(z_i, z_j) = \frac{z_i^t z_j}{\Vert z_i \Vert \Vert z_j \Vert}</math>
+Randomly select <math>N</math> samples and add their augmentations to get 2N samples.
+Compute similarity matrix <math>S \in \mathbb{R}^{2N \times 2N}</math>.
+<math>S_{ij}=\exp(sim(z_i, z_j)) =
+\begin{cases}
+& \text{if }i=j\\
+high & \text{if }{j=2k\text{ and }i=2k-1}\\
+low & otherwise
+\end{cases}
+</math>
+Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>.
+Practical observations:
+* Composition of data augmentation is important.
+* Larger batch sizes and longer training helps self-supervised learning!
+* Optimization over the MLP projection head <math>g</math> helps!
+** f is able to keep some information useful for the classification (e.g. color or orientation)
+Empirical results:
+* For image net, top-1 accuracy increases as number of parameters increases.
+* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training.
+===Theory of self-supervised learning===
+;[Arora ''et al.'' 2019]
+Modeling of similar pairs:
+* <math>x \sim D_{C1}(x)</math>
+* <math>x^+ \sim D_{C1}(x)</math> is semantically similar
+* <math>x^- \sim D_{C2}(x)</math> negative samples
+Downstream task:
+* pairwise classification
+** nature picks two classes C1, C2
+** generate samples from C1 & C2 and evaluate the classification loss
+* Assume <math>m \to \infty</math> so just look at population loss
+Notations:
+* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer
+* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss
+Result 1:
+* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math>
+** <math>\tau</math> is the collision probability for pair of random classes.
+Idea result:
+* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>.
+* In general it is impossible to get this.
+;[Tosh et al. 2020]
+This work connects self-supervised learning with multi-view representation theory.
+Start with (x, z, y) where x,z are two views of the data and y is the label.
+* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both.
+* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small
+* Formulate contrastive learning as an artificial ''classification'' problem:
+** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>.
+** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>.
+* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good
+<math>
+\begin{aligned}
+\mu(x) &= E[E[Y|Z]|X=x] \\
+&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\
+&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz
+\end{aligned}
+</math>
+Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math>
+Landmark embedding: <math>g^*</math> is computed via contrastive learning.
+How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math>
+z's are randomly generated from <math>P_Z</math>.
+Each <math>g^*</math> output is a single number.
+* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>.
+** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math>
+** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math>
+Direct embedding
+Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>.
+Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>.
+Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>?
+* If there is a hidden variable H s.t. X and Z are conditionally independent given H.
+# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math>
+# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>.
+==Meta Learning==
+In many applications, we don't have a large training dataset. However, humans can adapt and learn ''on the fly''.
+The key is to use prior knowledge in performing new tasks.
+How can we train AI/ML models similarly?
+Goal of meta learning: Train a model on different learning tasks such that it can solve new tasks using only a small number of training samples.
+Few shot classification problem:
+Inputs: <math>D_{metatrain} = \{(D_i^{train}, D_i^{test})\}_{i=1}^{n}</math>.
+<math>
+\begin{cases}
+D_{i}^{train} = \{(x_j^i, y_j^i) \}_{j=1}^{k}\\
+D_{i}^{test} = \{(x_j^i, y_j^i) \}_{j=1}^{k'}
+\end{cases}
+</math>
+* <math>i</math> is the index of the task
+* <math>j</math> is the index of the sample
+K-shot classification means k training samples per label. Some papers use k as the size of the whole training set.
+Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively?
+* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that:
+* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>.
+* Training procedure:
+** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>.
+* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>.
+[Fin et al. 2017]
+* Idea: train a model that can be easy to fine-tune on new tasks
+* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math>
+* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math>
+* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math>
+. Model-agnostic meta learning (MAML)
+* Use GD to optimize over <math>\theta</math>
+** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math>
+** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive.
+* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix.
+Reptile (Michal et al 2018)
+. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math>
+[Lee et al. 2019]
+* <math>f_{\theta}</math> is used to compute embeddings
+* <math>A</math> is a linear classifier (e.g. SVM)
+* Use dual form of SVM so # of optimization variables = # training samples * # classes
+. <math>A</math> is a non-parametric learning
+* Embedding: <math>\tilde{x} = f_{\theta}(x)</math>
+[Snell et al. 2017)
+* Define prototypes (cluster centers)
+** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math>
+** <math>P_{\theta}(y=k|x) = \frac{\exp(-d(f_{\theta}(x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math>
+: <math>A</math> is a black box (e.g. LSTM).
+==LSTMs and transformers==
+Lecture 23 (November 19, 2020)
+===Recurrent Neural Networks (RNNs)===
+Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math>
+Prediction at t: <math>y_t = W_{hy} h_{t}</math>
+;Backpropagation through time
+If <math>W</math> has largest singular value < 1, then gradient vanishes.
+If <math>W</math> has largest singular value > 1, then gradient explodes.
+Typically, gradient vanishes because of initialization of <math>W</math>.
+===Long Short Term Memory (LSTMs)===
+Goal is to solve the vanishing and exploding gradient problem.
+LSTM has several gates:
+* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math>
+* Forget gate: <math>f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)</math>
+* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math>
+* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math>
+* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math>
+;Bidirectional RNNs
+First LSTM takes input in correct order.
+Second LSTM takes input in reverse order.
+Concatenate outputs from both LSTMs.
+===Attention===
+Goal: To help memorize long source sentences in machine translation.
+;Encoder-decoder attention
+;Self-Attention
+===Transformer===
+;Positional encoding
+;Self-Attention
+Have queries, keys, and values.
+Multiply queries with keys, pass through softmax. Then times values.
+Yields attention of every work with respect to another.
+Initially, transformers had n=8 heads giving 8 queries, keys, and values.
+;Architecture
+Stack encoders.
+==Interpretability==
+;Interpretability Methods
+* Built-in model interpretability
+* Feature level interpretability
+* Instance based explanations
+We will focus on feature level interpretability.
+===Feature Level Interpretability===
+These are given through saliency maps.
+* Perturbation-based: Perturb the input to get another output and compute the difference.
+* Gradient-based
+===Gradient-based Methods===
+Take the derivative of the output with respect to the input.
+;Limitations
+* Too local and sensitive to slight perturbations
+* Saturated outputs lead to unintuitive gradients
+* Discontinuous gradients are problematic
+;SmoothGrad
+* Add gaussian noise to input and average the gradient.
+;Integrated Gradients
+* Average the gradients along path from baseline to input.
+;DeepLift
+* We don't care about gradient but the slope relative to the ''reference'' state
+;Limitations
+* Models must be able to compute the gradient of the output with respect to the input
+* Interpretation of neural networks is fragile
+** Saliency maps are uninterpretable for adversarial examples on clean models and adversarially trained models.
+* Needs white-box gradient access to the model.
+===Evaluation of interpretability methods===
+* Human evaluation
+** Can humans evaluate saliency?
+* Accuracy drop after removing ''salient'' features
+* Sanity checks
+** Model parameter randomization test - compare output of saliency method on trained vs untrained method to make sure saliency depends on model parameters.
+* Synthetic Data
+* Data randomization test
+** Train on random labels and see if saliency depends on relationship between input & output.
+Temporal saliency Rescaling
+* If you remove this feature, how is the gradient going to change.
+==Reinforcement Learning==
+Lecture 27 (December 1, 2020)
+;Goal: Sequential decision making problems
+* Current decisions can affect future decisions
+Examples: games, robotics, self-driving, finance
+Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>.
+;Definitions
+* <math>S</math> is the state space. This can be discrete or continuous.
+* <math>O</math> are the observations.
+* <math>A</math> is the set of actions which can be discrete or continuous.
+* <math>P</math> are the transition probabilities which model the environment.
+Example:
+* <math>|S|</math> = 2
+* <math>|A|</math> = 2
+In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>.
+The given variables are the trajectory <math>T_{t-1}</math>.
+Modeling this can be very complex for large <math>T</math>.
+Thus, we apply a markov assumption:
+* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math>
+Objective: To maximize the rewards
+* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math>
+Two regimes for rewards
+* Finite horizon; T is finite
+* Infinite horizon; T goes to infinity
+* Discount factor
+===Classical RL===
+Finite Markov Decision Process (MDP)
+* S: Finite number of states
+* A: Finite number of actions
+* <math>R_t = R(a_t, S_t)</math>
+* r discount factor
+Goal: Choose actions to maximize the total rewards.
+* Policy function <math>\pi</math> determines how actions should be taken.
+* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions.
+* Assume this is time invariant.
+* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>.
+* Q function (action-value function)
+** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math>
+* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>?
+** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math>
+* Use Bellman's equation
+** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math>
+** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
+** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
+** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex.
+All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1.
+This implies <math>I - \gamma P_{\pi}</math> is always invertible.
+However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
+===Value-Iteration (Bellman Recursion)===
+<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
+Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>
+<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>.
+;Claim: <math>L_{\pi}</math> is a contraction.
+Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>.
+<math>
+\begin{aligned}
+L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\
+&= \gamma P_{\pi}(v_1 - v_2)
+\end{aligned}
+</math>
+<math>
+\begin{aligned}
+\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\
+& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\
+& \leq \Vert v_1 - v_2 \Vert
+\end{aligned}
+</math>
+By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
+===Optimal Policy===
+Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
+Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>.
+The policy may not be unique but if two policies are equal, they have the same value function.
+Intermediate questions:
+* Given <math>v^*</math>, can we compute <math>Q^*</math>?
+<math>
+\begin{aligned}
+Q^*(s, a) &= \max_{\pi} Q(s,a)\\
+&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
+&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
+\end{aligned}
+</math>
+* Given <math>Q^*</math>, can we compute <math>v^*</math>.
+Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.
+<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
+;How to compare optimal policies?
+First approach: Value Iteration
+<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math>
+Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math>
+<math>v^*</math> is a fixed point of <math>L(v)</math>.
+Claim: <math>L</math> is also a contraction.
+Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.
+===Value Iteration===
+* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
+* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
+* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
+===Policy Iteration===
+Directly optimize policy.
+* Start with a random policy <math>\pi^{(0)}</math>
+* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
+* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
+* Find the best action to be taken at state <math>s</math>:
+** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
+* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
+* Repeat until convergence.
+Policy iteration is guaranteed to converge to an optimal policy.
+Oftentimes converges faster than value iteration.
+===Deep Reinforcement Learning===
+;Relaxing some unrealistic assumptions
+# Evaluate <math>v_{\pi}(s)</math>
+#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>
+# Improve the policy
+#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>
+#* Assumption <math>|S|=d</math>
+# How to represent <math>V(s)</math>?
+#* Can we use a neural network to represent <math>V(s)</math>? Yes
+;How to train <math>v_{\phi}</math>?
+* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>.
+** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math>
+* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss:
+** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math>
+* Iterate
+;Similarly we can parameterize the Q function
+Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math>
+We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).
+With model-free RL:
+We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math>
+This is called ''Q-Learning''.
+;What if we have continuous actions:
+* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve
+** [Gu ''et al.'' 2016] use quadratic functions.
+* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
+===Policy Gradient Method===
+Lecture 29 (Dec 8, 2020)
+Probability of observing a trajectory:
+<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>
+Reward for a trajectory:
+<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math>
+The average reward is:
+<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
+Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
+Gradient of the average reward:
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
+&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
+&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\
+&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\
+&= E[\log P_\theta(\tau) R(\tau)]
+\end{aligned}
+</math>
+<math>
+\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
+</math>
+Implies,
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &= E[...]\\
+&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
+\end{aligned}
+</math>
+;Summary
+* Sample trajectories
+* Approximate <math>\nabla_{\theta} J(\theta)</math>
+* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
+;Intuition
+<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>
+Formalizing ''trial & error''.
+[Finn & Levin, ICML]
+===Issues with Policy Gradient===
+;High variance of gradient estimation
+;Solutions
+* Subtract a baseline
+<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
+* Reward-to-go
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+\end{aligned}
+</math>
+;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
+Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
+===Actor-critic algorithms===
+Have an actor <math>\pi_{\theta}</math>.
+Have a critic <math>V_{\phi}/Q</math>
+<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
+===Other topics in RL===
+* Inverse RL
+* Multi-agent RL
+* Model-based RL
+==Summary of Course==
+;What we covered
+* Supervised DL
+* Unsupervised DL (GANs, VAEs)
+* Self-supervised DL
+* Meta-Learning
+* Learning with Attention (Transformers)
+* Deep RL
+* Optimization
+* Generalization
+* Robustness
+* Interpretability
+;What we didn't cover
+* Fairness
+* Privacy & Ethics
+* Bayesian DL
+* Federated Learning
+* Graph NNs
-==Misc==
+;Things which may be on the final
-[[Visible to::users]]
+* Transformers
+* Wasserstein distance
+* Kernel methods
 ==Resources==

Deep Learning: Difference between revisions