Deep Learning: Difference between revisions
No edit summary |
|||
(109 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | ||
* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos] | |||
==Basics== | ==Basics== | ||
A refresher of [[Machine Learning]] and Supervised Learning. | A refresher of [[Machine Learning]] and [[Supervised Learning]]. | ||
===Empirical risk minimization (ERM)=== | ===Empirical risk minimization (ERM)=== | ||
Line 93: | Line 94: | ||
\begin{aligned} | \begin{aligned} | ||
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ | \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ | ||
&=\frac{1}{2}(F(w) | &=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ | ||
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | ||
&= \lambda_{\min}(K(w)) L(w)\\ | &= \lambda_{\min}(K(w)) L(w)\\ | ||
Line 426: | Line 427: | ||
This shows that Rademacher complexity and VC-dimension are not useful for explaining generalization for neural networks. | This shows that Rademacher complexity and VC-dimension are not useful for explaining generalization for neural networks. | ||
===Theorem=== | ===Universal Approximation Theorem=== | ||
There exists a two-layer NN with Relu activations and <math>2n+d</math> parameters that can represent any function on a sample size <math>n</math> in d dimensions. | There exists a two-layer NN with Relu activations and <math>2n+d</math> parameters that can represent any function on a sample size <math>n</math> in d dimensions. | ||
Line 443: | Line 444: | ||
Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''. | Belkin ''et al.''<ref name="belkin2019reconciling"></ref> observe that as models get more over-parameterized in the interpolation regime, test error will begin decreasing with the number of parameters. This is called ''double descent''. | ||
== | ;Intuition: In the over-parameterized regime, there are infinitely many solutions in the manifold of <math>f_{w^*}</math>. | ||
[[ | For SGD, it is easier to find ''simple'' solutions (e.g. functions with small norms). This leads to better generalization. | ||
===Can we analyze the double descent curve for some simple distributions or models?=== | |||
Setup: | |||
Our features are <math>x = (x_1,..., x_d)</math> where <math>x_i</math> are from standard normal. | |||
Our labels are <math>y = x^t \beta</math>. This is the noise-free case. | |||
Our training set: <math>\{(x^{(i)}, y^{(i)})\}_{i=1}^{n}</math> written as <math>X = | |||
\begin{pmatrix} | |||
(x^{(1)})^t\\ | |||
\vdots\\ | |||
(x^{(n)})^t | |||
\end{pmatrix} | |||
</math> | |||
Learning: We select <math>p</math> features: <math>T \subseteq [d]</math>, <math>|T| = P</math> and fit a linear model <math>\beta^{*}_{T} \in \mathbb{R}^{p}</math>, <math>\beta^*_{T^c} =0</math>. Here <math>T^c</math> is the set of features we are not using. | |||
Define <math>X_{T} = | |||
\begin{pmatrix} | |||
(x^{(1)}_T)^t\\ | |||
\vdots\\ | |||
(x^{(n)}_T)^t | |||
\end{pmatrix} \in \mathbb{R}^{n \times P} | |||
</math> | |||
Quadratic loss: <math>\min_{\beta_T} \Vert X_T \beta_T - y \Vert_{2}^{2} \in \mathbb{R}</math>. | |||
The optimal solution is <math>\beta_{T}^* = (X_{T}^t X_{T})^{-1} X_{T}^t y = X_{T}^{+} y</math> where <math>X_{T}^{+}</math> is the ''Moore-penrose Pseudo-inverse''. | |||
Since we know <math>P_{X, Y}</math>, we can compute the generalization error exactly. | |||
<math> | |||
\begin{aligned} | |||
E_{X,Y} \left[(y - x^t \beta^*)^2 \right] &= E \left[(x^t(\beta - \beta^*))^2\right]\\ | |||
&= E \left[(\beta - \beta^*)^t x x^t (\beta - \beta^2)\right]\\ | |||
&= (\beta - \beta^*)^t E \left[ x x^t \right] (\beta - \beta^2)\\ | |||
&= \Vert \beta - \beta^* \Vert\\ | |||
&= \Vert \beta_{T^c} \Vert^2 + \Vert \beta_{T} - \beta_{T}^* \Vert^2 | |||
\end{aligned} | |||
</math> | |||
;Theorem: <math>B_{T^c} \neq 0</math> | |||
<math> | |||
E \left[(y - x^t \beta^*)^2 \right] = | |||
\begin{cases} | |||
\Vert B_{T^C} \Vert^2 (1 + \frac{p}{n-p-1}) & p \leq n-2\\ | |||
+\infty & n-1 \leq p \leq n+1\\ | |||
\Vert B_{T} \Vert ^2 (1 - \frac{n}{p}) + \Vert B_{T^c} \Vert^2 (1 + \frac{n}{p-n-1}) & p \geq n+2 | |||
\end{cases} | |||
</math> | |||
In other cases, ''prescient'' feature selection. We can include features in <math>T</math> by decreasing the order of <math>\beta_j^2 = \frac{1}{j^2}</math>. From this we get a behavior like double descent. | |||
===Related Works=== | |||
Jiang ''et al.''<ref name="jiang2019generalization"></ref> provide some emperical evaluations of different generalization bounds such as: | |||
* Sharpness-based bounds (PAC-Bayesian) | |||
* Norm-based bounds | |||
==Neural Tangent Kernels (NTKs)== | |||
Beginning of Lecture 7 (Sept. 22, 2020) | |||
===Linear Regression=== | |||
Assume we have a dataset:<br> | |||
<math>\{(x_i, y_i)\}_{i=1}^{n}</math> | |||
<math>y_i \in \mathbb{R}</math><br> | |||
<math>x_i \in \mathbb{R}^d</math><br> | |||
<math>f(w, x) = w^t x</math> | |||
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br> | |||
<math>\min_{W} L(w)</math><br> | |||
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br> | |||
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math> | |||
===Kernel Method=== | |||
<math>x_i \in \mathbb{R}^d \to \phi(x_i) \in \mathbb{R}^{D}</math> with <math>D >> d</math> | |||
Suppose <math>d=3</math> and <math>x = | |||
\begin{bmatrix} | |||
x_1\\x_2\\x_3 | |||
\end{bmatrix} | |||
\to | |||
\phi(x)= | |||
\begin{bmatrix} | |||
x_1\\ | |||
x_2\\ | |||
x_3\\ | |||
x_1 x_2\\ | |||
x_1 x_3\\ | |||
x_2 x_3 | |||
\end{bmatrix} | |||
</math> | |||
<math>f(w, x) = w^t \phi(x)</math><br> | |||
Is this model linear in w? Yes!<br> | |||
Is this model linear in x? No!<br> | |||
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br> | |||
Apply GD or convex optimization. | |||
Q: What is the issue here? | |||
* <math>\phi</math> is fixed! | |||
* <math>D = O(d^k)</math> | |||
** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math> | |||
;Kernel Trick: | |||
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br> | |||
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br> | |||
<math display="inline">K</math> is a PSD matrix. | |||
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>. | |||
;Polynomial Kernels | |||
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br> | |||
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>. | |||
Many classical techniques can be ''kernelized'': | |||
* SVM to Kernel SVM | |||
* Ridge regression to Kernel ridge regression | |||
* PCA to Kernel PCA | |||
===Neural Networks=== | |||
Consider a two-layer neural network.<br> | |||
We can write the output as:<br> | |||
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br> | |||
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br> | |||
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | |||
# Init N(0,1) | |||
# Our weights update along a trajectory: w(0), w(1), ... | |||
# Each <math>w</math> is a weight matrix. | |||
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | |||
This is called ''lazy'' training. | |||
* Not always the case! Especially for small <math>m</math>. | |||
Since the change in the model weights are not large, we can write the first-order taylor approximation:<br> | |||
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br> | |||
This model is linear in <math>w</math>.<br> | |||
<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br> | |||
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | |||
These features will not change during the optimization process because we use <math display="inline">w_0</math> | |||
Go back to our 2-layer NN:<br> | |||
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br> | |||
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br> | |||
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math> | |||
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br> | |||
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br> | |||
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math> | |||
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization. | |||
Based on law of large numbers, as m goes to infinity,<br> | |||
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br> | |||
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math> | |||
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br> | |||
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math> | |||
;Q: When is this taylor approximation good?<br> | |||
If the Hessian has bounded eigenvalues. (Hessian Control) | |||
;Analyze GD: | |||
<math>\eta \to 0</math> Gradient-flow<br> | |||
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br> | |||
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math> | |||
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br> | |||
<math> | |||
\begin{aligned} | |||
\frac{d\hat{y}(w(t)}{dt} &= - \nabla_{w} \hat{y}(w(t))^2 \frac{dw(t)}{dt} \\ | |||
&= - \nabla_{w} \hat{y}(w(t))^2 \nabla_{w} \hat{y}(w) (\hat{y}(w) - y)\\ | |||
&\approx -K(w_0) (\hat{y}(w(t))-y) | |||
\end{aligned} | |||
</math> | |||
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br> | |||
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br> | |||
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite). | |||
* SOTA NNs often outperform kernel methods (even based on NTKs) | |||
==Adversarial Robustness== | |||
Beginning of Lecture 8 (Sept. 24 2020) | |||
In standard ERM, we have a dataset: | |||
<math>\{(x_i, y_i)\}_{i=1}^{n}</math> where <math>x_i \in \mathbb{R}^d</math> and <math>y_i \in [c]</math> | |||
We optimize <math>\min_{\theta} E \left[ l(f_{\theta}(x), y) \right] = \frac{1}{n} \sum l(f_{\theta}(x_i), y_i) </math>. | |||
;What is an ''adversarial example''? | |||
<math>x'</math> is an adversarial example for <math>x</math> under a model <math>f_{\theta}()</math> if | |||
* <math>f_{\theta}(x) \neq f_{\theta}(x')</math> and | |||
* <math>f_{human}(x) = f_{human}(x)</math> | |||
One key challenge is that we don't have a precise definition of human perception. | |||
===Formulation=== | |||
Threat Models: <math>x' \in T(x, \epsilon)</math> | |||
Example: Lp model <math>\{x' | \Vert x' - x \Vert_{lp} \leq \epsilon \}</math> | |||
Non-Lp threat models: | |||
One problem with Lp models is that two identical images can have large L2 difference. For example, shifting or rotating an image. | |||
An example of a non-Lp threat model is using Wasserstein difference. | |||
===Attacks=== | |||
The goal is given <math>x</math> and <math>f_{\theta}()</math>, we want to compute <math>x'</math> such that | |||
<math>c* = f_{\theta}(x) \neq f_{\theta}(x')</math>. | |||
* Targeted: Given a target label <math>t \neq c^*</math>, <math>f_{\theta}(x') = t</math>. | |||
* Untargeted: <math>c^*</math> is not given | |||
We mostly focus on targeted attacks. | |||
* White-box: Adversary knows everything | |||
* Black-box: Adversary has limited knowledge | |||
We mostly focus on white-box attacks. | |||
* Inference-time (evasion) attacks | |||
* Training-time (poison) attacks | |||
We focus on Inference-time attacks. | |||
Attack Optimization: | |||
<math>\max_{\delta} l_{cls}(f_{\theta}(x'), y)</math> such that <math>\Vert x - x' \Vert < \epsilon</math> | |||
How to solve this optimization? | |||
====Fast Gradient Sign Method (FGSM)==== | |||
For <math>p=\infty</math>: | |||
<math> | |||
\begin{aligned} | |||
\delta_{FGSM} &= \max_{\Vert \delta \Vert \leq \epsilon} <\nabla l(f_{\theta}(x), y), \delta>\\ | |||
&= \epsilon sign(\nabla l(f_{\theta}(x), y)) | |||
\end{aligned} | |||
</math> | |||
Pros: | |||
* Fast because it is a one-step attack. | |||
====Projected Gradient Descent (PGD)==== | |||
This is an iterative attack using gradient descent. | |||
# Initalize <math>x^(0) = x</math> | |||
# At each iteration, calculate <math>\tilde{x}^{(t+1)} = x^{(t)} + \eta g^{(t)}</math>. | |||
# Then project back into the ball <math>x^{(t+1)} = \pi(\tilde{x}^{(t+1)}</math>. | |||
#* For <math>l_{\infty}</math>, just do a element-wise clamp: <math>x^{(t+1)} = clip(\tilde{x}^{(t+1)}, x - \epsilon, x + \epsilon)</math> | |||
In general, PGD > FGSM. | |||
===Alternative formulation=== | |||
<math>\min \Vert \delta \Vert</math> such that <math>f_{\theta}(x + \delta) = t \neq c^*</math> | |||
Lagrange dual form: | |||
<math>\min \Vert \delta \Vert - \lambda l_{cls}(f_{\theta}(x + \delta, y=c^*)</math> | |||
===Adversarial Training=== | |||
<math>\min_{\theta} E \left[ \max_{x'} l_{cls}(f_{\theta}(x'), y) \right]</math> | |||
How to solve it? | |||
* Inner max = attack problem (PGD) | |||
* Outer max = gradient descent | |||
===Other heuristic/emperical defenses=== | |||
Beginning of Lecture 9, Sept. 29 2020 | |||
ICLR 2018: 6-7 empirical defenses | |||
Athalye et al (ICML 2018) show that obfuscated/masked gradients provide a ''false'' sense of security. | |||
Such defenses can be ''easily'' broken using ''adaptive'' attacks. | |||
Obfuscated gradients provide shattered gradients (e.g. non-differentiable op) or stochastic gradients (e.g. randomized network). | |||
;How to identify obfuscated or masked gradients? | |||
Check the following: | |||
* One step attack > iterative attack. | |||
* Black-box attack > white-box attack. | |||
* Unbounded attack (i.e. <math>\rho</math> can go to infinity) does not yield 100% success rate. | |||
* Increasing attack budget doesn't increase the success rate. | |||
If any of the above are true then your defense may be based on obfuscated or masked gradients. | |||
;How to attack defenses using gradient masking? | |||
* Defense based on adding a ''non-differentiable'' operator: | |||
Example 1: | |||
<math>\hat{f}(x) = f(g(x))</math> with g non-diff and non smooth. | |||
In the attack, just use g(x) | |||
Example 2: | |||
Defense uses ''randomized'' classifier or stochastic gradients | |||
Just take the expectation over the randomization. | |||
==Are adversarial examples inevitable?== | |||
====Notations==== | |||
<math>S^{d-1} = \{x \in \mathbb{R} \mid \Vert x \Vert = 1\}</math> | |||
Let the geodesic distance be denoted by <math>d_{g}</math>. | |||
This is the length of the shortest path on the sphere. | |||
On sphere: <math>d_{\infty}(x, x') \leq d_{2}(x, x') \leq d_{g}(x, x')</math>. | |||
Classification problem: <math>\{1,..., C\} = [c]</math> labels. | |||
Each class has a density function <math>\rho_{c}</math> which is bounded. | |||
Let <math>U_c = \sup_{x} \rho_c(x)</math> be the largest density we can get. | |||
The <math>\epsilon</math>-expansion of A: | |||
<math>A(\epsilon, d) = \{x \mid d(x,z)\leq \epsilon \text{ for some } z \in A\}</math>. | |||
====Isoperimetric Inequality==== | |||
Of all closed surfaces that encloses a unit volume, the sphere has the smallest surface. | |||
Very intuitive but difficult to prove. (Osserman et al 1976) | |||
;Lemma (Lerg & Pellegrino 1951, simplified by Talagrand 1995) | |||
Consider a subset <math>A \subset S^{d-1} \subset \mathbb{R}^n</math> with normalized measure <math>\mu_1(A) \geq 1/2</math>. | |||
Using the geodesic metric, the <math>\epsilon</math>-expansion <math>A(\epsilon)</math> is at least as large as the <math>\epsilon</math>-expansion of a half sphere. | |||
;Lemma (Milman & Schechtman 1986) | |||
<math>A(\epsilon) \geq 1 - (\frac{\pi}{8})^{1/2} \exp(-\frac{d-1}{2} \epsilon^2)</math> | |||
As d goes to infinity, the right hand side goes to 1. | |||
In a high-dimensional space, an epsilon expansion will cover almost the entire area. | |||
====Theorem==== | |||
c-class classification | |||
<math>\rho_c, u_c</math> | |||
<math>v_c=\delta_{n-1} * u_c</math> is the max density relative to uniform density. | |||
<math>f_c = \mu_1 \{x \mid C(x)=c\}</math> is the area where the classifier <math>C</math> classifies as class c. | |||
Pick a class such that <math>f_c \leq \frac{1}{2}</math>. | |||
Sample a random point x from the true density <math>\rho_c</math>. | |||
With high probability, either: | |||
* x is misclassified or, | |||
* x has an <math>\epsilon</math>-close adversarial example. | |||
One of these will happen with probability <math>1-v_c (\frac{\pi}{8})^{1/2} \exp^{- ((d-1)/2) \epsilon^2}</math> | |||
Proof: | |||
Consider the region with the correct classification: <math>R=\{x \mid c(x)=c</math>. | |||
Here <math>u(R) = f_c \leq 1/2</math>. | |||
Consider the compliment <math>R^c</math>. | |||
The area of the complement is <math>u_1(R^c) \geq \frac{1}{2}</math>. | |||
The area of the epsilon expansion is <math>u_1(R^c(\epsilon)) \geq 1 - (\pi/8)^{1/2} \exp(-\frac{d-1}{2}\epsilon^2)</math>. Thus the ''safe zone'' is very small in high dimension. | |||
====Cubes==== | |||
Geometric isoparametric inequalities do not exist for cubes. | |||
However, algebraic inequalities exist. | |||
;Lemma | |||
<math>A \subset [0, 1]^d</math> with <math>vol(A) \geq \frac{1}{2}</math> | |||
<math>vol(A(\epsilon g d p)) \geq 1 - \frac{\exp(-2 \pi d^{1-(2/p^2)\epsilon^2})}{2 \pi \epsilon d^{1/2 - 1/p^2}}</math> | |||
For l2 p>=2, <math>p \geq 2</math>, <math>vol(A(\epsilon g d p)) \geq 1 - \frac{\exp(-2 \pi \epsilon^2)}{2 \pi \epsilon}</math> | |||
For p=2, the diameter of the hypercube is <math>O(\sqrt{d})</math> so we should use <math>\epsilon \approx O(\sqrt{d})</math>. | |||
For p=infinity, the diameter of the cube is 1 so we should pick a constant <math>\epsilon</math>. | |||
This shows if you pick a random sample, there is a high probability of it being misclassified or there being an adversarial example within epsilon. | |||
====Are adversarial examples inevitable in practice?==== | |||
This is an ill-posed question. | |||
It depends on the data distribution, threat model, and hypothesis class. | |||
==Provable Defenses== | |||
There are 3 types of Lp defenses: | |||
* Curvature-based defenses | |||
* IBP and Convex defenses | |||
* Randomzied smoothing | |||
For Non-Lp | |||
* Patch Threat | |||
* Sparse Threat | |||
* Wasserstein Threat | |||
===Randomized Smoothing=== | |||
A smoothed classifier: <math>\bar{f}(x) = E_{\epsilon}[f(x+\epsilon)]</math>. | |||
The idea is that the decision boundary becomes smoother. | |||
Gaussian Smoothing for L2 attacks: | |||
;Theorem (Cohen et al., 2019) | |||
No adversarial example exists within the radius: | |||
<math>\frac{\sigma}{2}\left(\Phi^{-1}(p_1(x))-\Phi^{-1}(p_2(x))\right)</math> | |||
The proof is based on Neyman & Pearson lemma. | |||
;Theorem (Levine, Singla, F2019, Salman et al 2019) | |||
<math>\Phi^{-1}(\bar{f}(x))</math> is Lipschitz with constant <math>1/\sigma</math> | |||
The worst g is a stepwise function. Then <math>\Phi^{-1}(\bar{g})</math> is a linear function. | |||
For L2 attacks, you can use Gaussian noise. For L1 attacks, you can use Laplace noise. | |||
;Theorem (KLGF, ICML 2020) | |||
Using any symmetric i.i.d. smoothing, | |||
<math>r_{p}^* \leq \frac{\sigma}{2 \sqrt{2} d^{1/2 - 1/p}}\left(\frac{1}{\sqrt{1-p_1(x)}} + \frac{1}{\sqrt{p_2(x)}}\right)</math> | |||
If we use Gaussian smoothing against Lp attacks, we get: | |||
<math>r_p = \frac{\sigma}{2d^{1/2 - 1/p}}\left( \Sigma^{-1}(p_1(x)) - \Sigma^{-1}(p_2(x)) \right)</math> | |||
This shows that Gaussian smoothing is optimal (up to a constant) within i.i.d. smoothing distributions against Lp attacks. | |||
===Sparse Threat=== | |||
Here the adversary can change up to <math>\rho</math> pixels in the image. | |||
The idea is to classify each example based on only k random pixels in the image. This is performed several times and the a voting scheme determines the final label. | |||
;Theorem (Levine, F. AAAI 2020) | |||
For inputs <math>x</math> and <math>x'</math> with <math>\Vert x - x' \Vert_{l_0} \leq \rho</math>, for all i | |||
<math>\vert p_i(x) - p_i(x')\vert \leq \delta</math> where <math>\delta = 1 - \frac{\binom{d-\rho}{k}}{\binom{d}{k}}</math>. | |||
Robustness vs Accuracy Trade-off: | |||
Increasing <math>k</math> boosts classification accuracy but also increases <math>\Delta</math>. | |||
===Relationship between Threat Models=== | |||
Use a neural perceptual threat model to approximate the true perceptual distance. | |||
Use LPIPS as <math>d_{neural}(x, x') = \Vert \phi(x) - \phi(x') \Vert</math> where <math>\phi</math> are normalized feature maps. | |||
Our attack optimization is now: | |||
<math> | |||
\begin{aligned} | |||
\max_{x'} &l_{cls}(f(x'), y)\\ | |||
& d_{neural}(x, x') \leq \rho | |||
\end{aligned} | |||
</math> | |||
From this, we get Perceptual Projected Gradient Descent (PPGD) and Lagrangian Perceptual Attacks (LPA). | |||
We also get Perceptual Adversarial Training (PAT). | |||
==Poisoning Attacks and Defenses== | |||
Another type of adversarial robustness. | |||
So far, we train on training data using SGD and do adversarial attacks at inference time. | |||
However, deep learning models require a large amount of training data which makes it hard to manually verify or trust training samples. | |||
In this case, an adversary can do ''data poisoning'' by perturbing some of the training samples. | |||
;Question: What is the goal of data poisoning? | |||
* To reduce the test time accuracy? | |||
** Simple to mitigate by looking at the performance for validation set. | |||
* Targeted misclassification: to cause one or more target samples to be misclassified as another class. | |||
*: '''This is what we focus on''' | |||
Given clean/base images: <math>\{(x_i, y_i)\}_{i=1}^{n}</math>, create poison images <math>\{(x_i^{(P)},y_i^{(P)}\}_{i=1}^{J}</math>. | |||
Our new training set is: <math>\{(x_i, y_i)\}_{i=1}^{n} \cup \{(x_i^{(P)},y_i^{(P)}\}_{i=1}^{J}</math>. | |||
We train using SGD on some model <math>f</math>. | |||
The goal is to make <math>f(x_t)</math> produce the wrong label. | |||
===Naive attack=== | |||
<math>x_t</math> is a cat. Our goal is that <math>f(x_t)</math> is dog. | |||
One way is to add multiple examples of <math>x_t</math> with the wrong label, dog. | |||
For a sufficiently large model, it will predict a dog image. | |||
This is called ''flooding'' the training set. | |||
This is not too concerning because a simple filtering or outlier detection can identify poison samples. | |||
Two types of attacks | |||
* backdoor attacks | |||
* triggerless attacks | |||
===Backdoor attacks=== | |||
The idea is too add a ''trigger'' or watermark to the image to make it misclassify it. | |||
Gu ''et al'' (2017) <ref name="gu2017badnets"></ref> randomly select a small portion of training set, apply a backdoor trigger, and ''change the label to the target label''. | |||
However this is not a clean attack because you need to change the labels. | |||
Turnet ''et al.'' craft clean-label backdoor attacks. | |||
Here they take examples <math>x_j^{(b)}</math> (e.g. airplane) and apply an adversarial perturbation to get <math>\tilde{x}_j^{(b)}</math>. | |||
The adversarial perturbation is obtained by training a network <math>g</math>. | |||
By the ''transferability'' of adversarial attacks, a new network <math>f</math> is likely to output a wrong label. | |||
Then they add a trigger to the image. | |||
Pros: You can use the trigger to poison several examples. | |||
Cons: There is a trigger. | |||
===Triggerless poison attacks=== | |||
Shafahi ''et al'' introduce feature collisions. | |||
Suppose f is the feature layer and g is a pretrained network. | |||
Suppose x_t is a cat and g classified cats and airplanes. | |||
The idea is to apply adversarial perturbation to some base image to be close to the target in the feature space. | |||
If we train the model on the poisoned samples, the decision boundary is going to fold. | |||
<math> | |||
x_j^{(p)} = \operatorname{argmax}_{x} \Vert g(x) - g(x_t) \Vert_{l2}^{2} + \beta \Vert x - x_j^{(b)} \Vert _{l2}^2 | |||
</math> | |||
;Do these attacks actually work? | |||
Schwarz Schild ''et al.'' test on multiple datasets and find that these attack do not really work. | |||
The attacks are heavily dependent on the particular setup. | |||
* Feature Collision attack: they assume g is trained on adam but the success rate is bad for SGD. | |||
* Data augmentation: Success rates falls for different model. | |||
* For black-box attacks, success rate reduces | |||
* Success rate also depends on the size of the dataset. | |||
===Provable defenses against general poison attacks=== | |||
Levine and Feizi (2020) | |||
Consider a ''general'' poisoning attack where the attacker can insert or remove samples from the training set. | |||
We measure the '''attack magnitude''' by the symmetric difference between the clean and poisoned sets. | |||
Symmetric difference is defined as <math>A \ominus B = (A \setminus B) \cup (B \setminus A)</math>. | |||
Last lecture, we had provable defenses against ''sparse'' inference time attacks using randomized ablation. | |||
Deep Partition Aggregation (DPA): | |||
# Partition the training set into <math>k</math> partitions. | |||
#* Use a hash function <math>h</math> to deterministically define partition assignments for samples. The hash should only depend on <math>x</math> and not the labels <math>y</math>. | |||
# Train a classifier for each partition: <math>f_1,...,f_k</math>. | |||
# At test time, run <math>x_t</math> through every classifier and take the majority class. | |||
<math>K_1</math> be the base classifier returning the majority class <math>C</math>. | |||
<math>K_2</math> be the runner up class <math>C'</math>. | |||
The gap <math>\Delta = K_1 - K_2</math>. | |||
To change the plurality C to C', the adversary needs to change the output of at least <math>\Delta/2</math> base classifiers. | |||
This is probably robust against any insertion/deletion <math>\leq \Delta/2</math>. | |||
Both DPA and SS-DPA (semi-supervised DPA) are state-of-the-art against label flipping attacks. | |||
==Variational Autoencoders (VAE)== | |||
Deep generative models: | |||
Given training data <math>\{x_i\}_{i=1}^{n}</math>. | |||
The goal is to ''generate'' realistic fake samples. | |||
Given a good generative model, we can do denoising, inpainting, domain transfer, etc. | |||
Probabilistic Model: | |||
Suppose our dataset is <math>\{x_i\}_{1}^{n}</math> with <math>x_i \in \mathbb{R}^d</math> | |||
# Generate latent variables <math>z_1,...,z_n \in \mathbb{R}^r</math> where <math>r << d</math>. | |||
# Assume <math>X=x_i | Z = z_i \sim N \left( g_{theta}(z_i), \sigma^2 I \right)</math>. | |||
#* Here <math>g_\theta</math> is called the ''generator'' or ''decoder'' function. | |||
Q: How can we pick good model parameters <math>\theta</math>? | |||
Using maximum likelihood: | |||
<math> | |||
\begin{align*} | |||
\max_{\theta} P(\{x_i\}; \theta) &= \prod P(x_i; \theta)\\ | |||
&= \max_{\theta} \sum_{i=1}^{n} \log P_{\theta}(x_i)\\ | |||
&= \max_{\theta} \sum_{i=1}^{n} \log \left( \int_{z} P(z) P(x_i|z) dz \right)\\ | |||
\end{align*} | |||
</math> | |||
This is hard to compute. | |||
Instead we calculate a lower bound and maximize the lower bound: | |||
<math> \max_{\theta} l(\theta) \geq \max_{\theta, \phi} J(\theta, \phi)</math> | |||
===ELBO / Variational lower bound=== | |||
<math> | |||
\begin{aligned} | |||
&P(x_i | z) = \frac{P(z | x_i) P(x_i)}{P(z)}\\ | |||
\implies& \log P(x_i | z) + \log P(z) = \log P(z | x_i) + \log P(x_i)\\ | |||
\implies& E_z[\log P(x_i)] = E[ \log P(x_i | z) + \log P(z) - \log P(z | x_i)] \\ | |||
\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + E[\log P(z)] - E[\log P(z|x_i)] + (E[\log q_i(z)] - E[\log q_i(z)])\\ | |||
\implies& \log P(x_i) = E_{z \sim q_i}[\log P_{\theta}(x_i | z)] + (E_q[\log q_i(z)]- E_q[\log P(z|x_i)]) - (E_q[\log q_i(z) - E_q[\log P(z)]])\\ | |||
\implies& \log P(x_i) = E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] + KL \left(q_i \Vert P(z|x_i) \right) - KL \left(q_i \Vert P(z) \right)\\ | |||
\end{aligned} | |||
</math> | |||
The second term is hard to compute so we ignore it. It is a positive term. | |||
Thus: | |||
<math>\log P(x_i) \geq E_{z \sim q_i} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math> | |||
Optimization: | |||
<math>\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)</math> | |||
<math>q(z|x) \sim N\left( f_{\phi}(x), \sigma^2 I \right)</math> | |||
Here, <math>f_{\phi}(x)</math> is called the encoder. | |||
The claim is that <math>KL \left(q_i \Vert P(z) \right)</math> is easier to compute: | |||
<math> | |||
\begin{align*} | |||
&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[\log P_{\theta}(x_i | z) \right] - KL \left(q_i \Vert P(z) \right)\\ | |||
=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ \log \exp(-\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2)) - \log \exp(-\Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2)) \right]\\ | |||
=&\max_{\theta, \phi} \sum_{i=1}^{n} E_{z \sim q} \left[ -\Vert x_i - g_{\theta}(z) \Vert^2 /(2\sigma^2) + \Vert z - f_{\phi}(z) \Vert^2 /(2\sigma^2) \right]\\ | |||
\end{align*} | |||
</math> | |||
We use SGD to optimize <math>\theta, \phi</math>. | |||
Using the reparameterization trick, <math>z = \mu + \Sigma^{1/2}\epsilon</math> for <math>\epsilon \sim N(0, I)</math>. | |||
;ELBO | |||
<math>\max_{\theta, \phi} E_{z \sim q}[\log P(x|z)] - KL(q(z|x) \Vert P(z))</math> | |||
Issue: Posterior collapse. | |||
In practice, sometimes the posterior <math>q</math> does not depend on x: <math>q(z|x) \approx q(z)</math>. | |||
===β-VAE=== | |||
VAEs have many design choices: | |||
* Prior distribution <math>P(z)</math> chosen to be normal. | |||
* Posterior distribution <math>q(z|x)</math> chosen to be <math>N(f(x), \sigma^2 I)</math>. | |||
However this often leads to blurry images. | |||
One way to address this is to increase the expressiveness of the prior and posterior distributions. | |||
This leads to Hierarchical VAEs. | |||
The idea is that latent variables are partitioned to disjoint groups: | |||
<math>z = \{z_1, ..., z_L\}</math> | |||
<math>P(z) = \prod_{l}P(z_l | z_{<l})</math> | |||
<math>q(z|x) = \prod_{l}q(z_l | z_{<l}, x)</math> | |||
Vandet et al. create NVAE which is Hierarchical VAE + some tricks. | |||
VQ-VAE (Vector quantized VAE) perform quantization of the latent space. | |||
The quantization is non differentiable but they can copy the gradients. | |||
==Generative Adversarial Networks (GANs)== | |||
Given data <math>\{y_1,...,y_n\}</math>. | |||
The goal of the generator is to take random noise <math>\{x_i,...,x_n\}</math> and generate fake data <math>\{\hat{y}_1,...,\hat{y}_n\}</math>. | |||
Then there is a discriminator which takes in <math>\{y_i\}</math> and <math>\{\hat{y}_i\}</math> and guide the generator. | |||
In practice, both use deep neural networks. | |||
The optimization is <math>\min_{G} \max_{D} f(G, D)</math>. | |||
GAN training is challenging. | |||
Oftentimes, there are convergence issues. | |||
There can also be mode collapsing issues. | |||
Generalization can be poor and performance evaluation is subjective. | |||
A common approach for training GANs is using alternating gradient descent. | |||
However, this usually does not converge to <math>G^*</math>. | |||
===Reducing unsupervised to supervised=== | |||
;Formulating GANs | |||
Given <math>\{y_i\}</math> and <math>\{x_i\}</math>. | |||
We need to find a generator <math>G</math> s.t. <math>G(X) \stackrel{\text{dist}}{\approx} Y</math>. | |||
Given some data <math>\{y_i\}</math>, generate some randomness <math>\{x_i\}</math>. | |||
Create a ''coupling'' <math>\pi()</math> to create paired examples <math>\{(x_{\pi(i)}, y_i)\}</math>. | |||
Then we have: | |||
<math>\min_{\pi} \min_{G} \frac{1}{n} \sum_{i=1}^{n} l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))</math> | |||
We can replace the coupling with a joint distribution: | |||
<math>\min_{\mathbb{P}_{X, Y}} \min_{G} \frac{1}{n} E_{\mathbb{P}_{X,Y}}[ l(\mathbf{y}_i, G(\mathbf{x}_{\pi(i)}))]</math>. | |||
By switching the min and substituting <math>\hat{Y} = G(X)</math>: | |||
<math>\min_{G} \min_{\mathbb{P}} E_{\mathbb{P}_{X,Y}}[l(Y, \hat{Y})]</math>. | |||
The inner minimization is the optimal transport distance. | |||
====Optimal Transport (Earth-Mover)==== | |||
Non-parametric distances between probability measures. | |||
This is well-defined. | |||
Cost of ''transporting'' yellow to red points: | |||
<math>\min_{\pi} \frac{1}{n} \sum_{i=1}^{n} l(y_i, \hat{y}_{\pi(i)})</math>. | |||
If using l2, then <math>dist(P_{Y},P_{\hat{Y}}) = W(P_{Y}, P_{\hat{Y}})</math>. | |||
;Optimization | |||
The primal is <math>dist(P_Y, Y_{\hat{Y}}) = \min E[l(Y, \hat{Y})]</math>. | |||
;WGAN Formulation | |||
The dual of <math>\min_{G} W_1(P_Y, P_{\hat{Y}})</math> is <math>\min_{G} \max_{D} \left[ E[D(Y)] - E[D(\hat{Y})] \right]</math>. | |||
The lipschitz of the discriminator can be enforced by weight clipping. | |||
===How to evaluate GANs?=== | |||
;Inception Score | |||
Use a pre-trained network (Inception-v3) to map a generated image to its probabilities. | |||
<math>IS(G) = \exp \left( E_{x \sim P_{\hat{X}}} KL( p(y|x) \Vert p(y) ) \right)</math> | |||
Mutual Information interpretation: | |||
<math>\log(IS(G)) = I(G(Z);y) = H(y) - H(y|G(z))</math> | |||
* The first term <math>H(y)</math> represents diverse labels. | |||
* The second score represents high confidence. | |||
IS is misleading if it only generates one image per class. | |||
;FID Score | |||
Use a pre-trained network (Inception) to extract features from an intermediate layer. | |||
Then model the data distribution using multivariate Gaussian with mean <math>\mu</math> and covariance <math>\Sigma</math>. | |||
FID is Frechet Inception Distance. | |||
<math>FID(x, y) = \Vert \mu_{x} - \mu_{g} \Vert_2^2 + Tr(\Sigma_{x} + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2})</math> | |||
===A Statistical Approach to GANs=== | |||
GANs do not have explicit probability models. | |||
This is in contrast to maximum-likelihood models like VAEs. | |||
GANs focus on minimizing distance between distributions. | |||
This yields high-quality samples but inability to sample likelihoods. | |||
VAEs maximize lower bound on likelihood. However, you get blurry samples. | |||
The key idea is to have an explicit model for the data: | |||
<math>f_{Y}(y|X=x) ~ exp(-l(y, G(x))/\lambda)</math> | |||
;Theorem (BHCF 2019) | |||
... | |||
Entropic GANs meat VAEs. | |||
===Distributionally Robust Wasserstein=== | |||
Robust Wasserstein: | |||
<math> | |||
\begin{aligned} | |||
\min_{P_{\tilde{X}}, P_{\tilde{Y}}} | |||
\end{aligned} | |||
</math> | |||
==Min-max Optimization== | |||
Beginning of Lecture 14 (Oct. 15, 2020) | |||
<math> | |||
\DeclareMathOperator{\Tr}{Tr} | |||
\DeclareMathOperator{\VCdim}{VCdim} | |||
\DeclareMathOperator{\sign}{sign} | |||
\DeclareMathOperator{\rank}{rank} | |||
\DeclareMathOperator{\argmin}{argmin} | |||
\DeclareMathOperator{\argmax}{argmax} | |||
</math> | |||
;Problem | |||
<math>\min_{x \in X} \max_{y \in Y} f(x,y)</math>. | |||
* This is a zero-sum game. | |||
* Assume <math>f</math> is smooth and differentiable. | |||
;Goal | |||
Find <math>(x^*, y^*)</math> which is a global solution, saddle point, or equilibrium | |||
* <math>y^* \in \argmax f(x^*,y)</math> | |||
* <math>x^* \in \argmin f(x, y^*)</math> | |||
We know: | |||
<math>f(x^*,y) \leq f(x^*, y^*) \leq f(x, y^*)</math> | |||
===Simultaneous Gradient Descent (GDA)=== | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_{x} f(x_t, y_t)\\ | |||
y_{t+1} = y_t + \eta \nabla_{y} f(x_t, y_t) | |||
\end{cases} | |||
</math> | |||
;Notes | |||
* <math>x, y \in \mathbb{R}^d</math> | |||
* In the constrained case, we need to project back onto <math>S</math>. | |||
Define <math>\theta = [x, y]</math>. | |||
Then we can write the above update equation as <math>\theta_{t+1} = \theta_{t} + \eta \overrightarrow{g}(\theta_t)</math> where | |||
<math>\overrightarrow{g}(\theta_t) = | |||
\begin{bmatrix} | |||
- \nabla_x f(x_t, y_t)\\ | |||
\nabla_y f(x_t, y_t) | |||
\end{bmatrix} | |||
</math> | |||
Or in other words, <math>\theta_{t+1} = F(\theta_t)</math>. | |||
We want <math>F</math> to lead to <math>\theta^*</math> until <math>\theta^* = F(\theta^*)</math>. | |||
Here, <math>\theta^*</math> is a fixed point of <math>F</math>. | |||
{{hidden | Example | | |||
Consider <math>\min_x \max_y f(x,y)</math> with <math>f(x,y) = xy</math>. | |||
his has a global saddle point at <math>(0,0)</math>. | |||
Our update rule is: | |||
<math> | |||
\begin{pmatrix} | |||
x_{t+1} \\ y_{t+1} | |||
\end{pmatrix} | |||
= | |||
\begin{pmatrix} | |||
1 & -\eta\\ | |||
\eta & 1 | |||
\end{pmatrix} | |||
\begin{pmatrix} | |||
x_t \\ y_t | |||
\end{pmatrix} | |||
</math> | |||
Does this converge? | |||
We can check the magnitudes: | |||
<math> | |||
\begin{cases} | |||
\Vert x_{t+1} \Vert^2 = \Vert x_t \Vert^2 + \eta^2 \Vert y_t \Vert^2 - 2 \eta x_t y_t\\ | |||
\Vert y_{t+1} \Vert^2 = \Vert y_t \Vert^2 + \eta^2 \Vert x_t \Vert^2 + 2 \eta x_y y_t | |||
\end{cases} | |||
</math> | |||
<math> | |||
\Vert x_{t+1} \Vert^2 + \Vert y_{t+1} \Vert^2 = (1 + \eta^2) (\Vert x_t \Vert^2 + \Vert y_t \Vert^2) | |||
</math> | |||
Thus, each update gets further and further from the origin. | |||
}} | |||
===Convex-concave min-max=== | |||
The min-max theorem by [Von Neumann (1928)]. | |||
* Suppose <math>X,Y</math> be compact/convex. | |||
* Suppose <math>f</math> is continuous and convex-concave. | |||
Then: | |||
* <math>\min_{x \in X} \max_{y \in Y} f(x,y) = \max_{y \in Y} \min_{x \in X} f(x,y)</math> | |||
* min-max optimal point is unique if f is strictly convex-concave otherwise a convex-set of solutions exists. | |||
;Notes | |||
* If <math>f</math> is non-convex concave, it doesn't hold. | |||
* bilinear core: <math>f(x,y) = x^t A y + b^t x + c^t y</math> | |||
[Von Neumann-Dantzig 1947] show that there is a strong connection between the min-max theorem and strong <math>L_p</math> duality. | |||
[Frennd & Schapive 199] show the convergence of sim GDA in an ''average sense'': | |||
Assume: | |||
* f-> convex in x/concave in y | |||
* S-> convex/compact/bounded | |||
* <math>\eta \approx \frac{1}{\sqrt{T}}</math> | |||
Then for Sim GDA: <math>f(\bar{x}, \bar{y}) \to f(x^*, y^*)</math> with order <math>O(1/\sqrt{T})</math> | |||
No guarantees exist for the last iteration. | |||
===Optimistic GDA (OGDA)=== | |||
Gradient descent with negative momentum: | |||
<math>x_{t+1} = x_t - \eta \nabla f(x_t) + \frac{\eta}{2} \nabla f(x_{t-1})</math> | |||
This technique by [Popov 1980] helps the convergence stability of GD. | |||
;OGDA | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_x f(x_t, y_t) + \frac{\eta}{2} \nabla_x f(x_{t-1},y_{t-1})\\ | |||
y_{t+1} = y_t + \eta \nabla_y f(x_t, y_t) - \frac{\eta}{2} \nabla_y f(x_{t-1},y_{t-1}) | |||
\end{cases} | |||
</math> | |||
{{hidden | Example | | |||
<math>f(x, y) = xy</math> | |||
Using OGDA | |||
<math> | |||
\begin{cases} | |||
x_{t+1} = x_t - \eta \nabla_x y_t + \frac{\eta}{2} \nabla_x y_{t-1}\\ | |||
y_{t+1} = y_t + \eta \nabla_y x_t - \frac{\eta}{2} \nabla_y x_{t-1} | |||
\end{cases} | |||
</math> | |||
This small changes allows GDA to converge to <math>(0,0)</math>. | |||
}} | |||
OGDA has ''last iter'' convergence and linear rates for unconstrained bilinear min-max. | |||
===Nonconvex-nonconcave min-max opt=== | |||
The goal is to find a local saddle point. | |||
====Stability==== | |||
If we drift away from <math>(x^*,y^*)</math> then the optimization is unstable. | |||
If we remain close, the optimization is stable even if we never converge. | |||
;Asymptotic Stability | |||
If dynamics start close enough to <math>\theta^*</math> it remains close. | |||
If dynamics converges to <math>\theta^*</math>, it is locally asymptotically stable. | |||
Recall <math>\theta_{t+1} = F(\theta_t) = \theta_t + \eta \overrightarrow{g}(\theta_t)</math>. | |||
Jacobian of f: <math>J(\theta) = I + \eta H(\theta)</math>. | |||
where the Hessian is <math>H(\theta) = | |||
\begin{pmatrix} | |||
- \nabla_{xx}^2 f & -\nabla_{xy}^2 f\\ | |||
\nabla_{xy}^2 f & \nabla_{yy}^2 f\\ | |||
\end{pmatrix} | |||
</math> | |||
(Linear) stability: a fixed point <math>\theta^*</math> is stable if | |||
<math>| \lambda_{\max}(J(\theta^*)) | = \rho(J(\theta^*)) \leq 1</math>. | |||
Lemma: If linearly stable but <math>\rho(J(\theta^*)) < 1</math> then asymptotic stability. | |||
====Strongly local min-max==== | |||
Definition: | |||
<math> | |||
\begin{cases} | |||
\lambda_{min}(\nabla^2_{xx} f) > 0\\ | |||
\lambda_{max}(\nabla^2_{yy} f) < 0 | |||
\end{cases} | |||
</math> | |||
Simultaneous GDA: | |||
<math>H = | |||
\begin{pmatrix} | |||
- \nabla_{xx}^2 f & -\nabla_{xy}^2 f\\ | |||
\nabla_{xy}^2 f & \nabla_{yy}^2 f\\ | |||
\end{pmatrix} | |||
</math> | |||
Consider <math>\theta^*</math> is a local min-max. Then both of the diagonal matrices (<math>-\nabla^2_{xx}</math> and <math>\nabla^2_{yy} f</math>) will be negative semi definite. | |||
Lemma: | |||
Eigenvalues of the hessian matrix will not have a positive real part: <math>Re(\lambda(H)) < 0</math>. | |||
Why? | |||
<math> | |||
\begin{pmatrix} | |||
A & B\\ | |||
-B^T & C | |||
\end{pmatrix} | |||
\begin{pmatrix} | |||
v \\ u | |||
\end{pmatrix} | |||
= | |||
\lambda | |||
\begin{pmatrix} | |||
v \\ u | |||
\end{pmatrix} | |||
</math> | |||
Summing up both results in: | |||
<math> | |||
\begin{aligned} | |||
&(v^H A v + u^H C u) + (v^H B u - u^H B^T v) = \lambda (\Vert v \Vert^2 + \Vert u \Vert^2)\\ | |||
\implies &Re(v^H A v + u^H C u) = Re(\lambda)(\Vert v \Vert^2 + \Vert u \Vert^2) < 0\\ | |||
\implies &Re(\lambda) < 0 | |||
\end{aligned} | |||
</math> | |||
For asymptotic stability, we need <math>| \lambda_{\max} (J) | < 1</math>. | |||
<math>J = I + \eta H</math> with lr <math>\eta > 0</math> | |||
;Lemma | |||
If H has eigenvalue with negative real parts then eigenvalues of J lie in a unit ball iff <math>\eta < \frac{1}{|Re(\lambda)} \frac{2}{1+ (Im(\lambda)/Re(\lambda))^2}</math> for all <math>\lambda</math>. | |||
The convergence rate of Sim GDA is proportional to the spectral radius <math>O(\rho(J)^t)</math>. | |||
Thus we want the spectral radius to be small (e.g. 1/2) for a fast convergence rate. | |||
Suppose we have a really large negative real part in the eigenvalue. | |||
Then <math>\eta</math> will need to be really small which we don't want. | |||
Similarly if <math>Im(\lambda)/Re(\lambda)</math> is large then we will have slow convergence. | |||
{{ hidden | Proof | | |||
Suppose <math>\lambda</math> is an eigenvalue of H. | |||
<math>\lambda = -a + ib</math> with <math>a > 0</math>. | |||
<math>J = I + \lambda H</math> | |||
<math> | |||
\begin{aligned} | |||
|1 + \eta \lambda|^2 &= | 1 + \eta (-a + ib)| ^2\\ | |||
&= |(1 - \eta a) + i(\eta b)|^2\\ | |||
&= 1 + \eta^2 a^2 - 2 a \eta + \eta^2 b^2 < 1\\ | |||
\implies & \eta(a^2+b^2) - 2a < 0\\ | |||
\implies & \eta < \frac{2a}{a^2 + b^2} = \frac{1}{a} \frac{2}{1+(b/a)^2} | |||
\end{aligned} | |||
</math> | |||
}} | |||
===Regularized GDA=== | |||
<math>\theta_{t+1} = \theta_t + \eta R(\theta_t) G(\theta_t)</math> | |||
* <math>R(\theta)</math> is chosen to not change the fixed point of the original problewm. | |||
* Suppose <math>R(\theta)</math> is invertible. <math>\theta^*</math> is a fixed point iff <math>g(\theta^*) = 0</math>. | |||
Proof: | |||
(->) If <math>\theta^*</math. is a fixed point the <math>0 = \eta R(\theta^*) g(\theta^*) \implies g(\theta^*) = 0</math>. | |||
(<-) If <math>g(\theta^*)=0</math>, we want to show <math>\theta^*</math> is a fixed point. | |||
If <math>g(\theta^*)=0</math> then <math>\tilde{J}(\theta^*) = I + \eta R(\theta^*) H(\theta^*) + 0</math>. | |||
<math>R(\theta) = I - \gamma H^T</math> | |||
<math>R(\theta) H(\theta) = (I-\gamma H^T)H = H - \gamma H^T H</math> | |||
<math>J = I - \eta H - \eta \gamma H^T H</math> | |||
Thus this regularization pushes the real parts of the eigenvalues to be more negative so that the training will be more stable. | |||
This reduces <math>Im(\lambda)/Re(\lambda)</math> but may not necessarily help because it increases <math>Re(\lambda)</math>. | |||
===Approximate Local minmax=== | |||
Under continuous and differentiable assumptions, the space of fixed-points is guaranteed to be non-empty (by Brouwev's Fixed Point thm) however the set of local min-max can be empty. | |||
[Dskalakis & Penges 2018, Adolphs et al 2018] | |||
In an unconstrained case, the strong local min-max set is a subset of stabled fixed points of GDA which is a subset of stable points of OGDA. | |||
;Definition | |||
<math>(x^*, y^*)</math> is an <math>(\epsilon, \delta)</math> approximate local min-max if: | |||
<math>f(x^*, y) - \epsilon \leq f(x^*, y^*) \leq f(x, y^*) + \epsilon</math> | |||
<math>\forall (x,y) \in B_{\delta}(x^*, y^*)</math> | |||
Suppose f is a G-lipschitz and L-smooth. | |||
For <math>\delta < \epsilon/G</math>, every point is a local min-max. | |||
For <math>\delta > \sqrt{2\epsilon/L}</math>, the local min-max may not exist and np-hard to compute. | |||
For the region in between, the local min max is guaranteed to exist. The compute is PPAD (super polynomial) and not np-hard. | |||
;So what? | |||
For ERM non-convex minimization, we us GD to solve. Over-parameterization helped us in terms of global convergence. | |||
Would over-parameterization help in solving non-convex concave min-max? | |||
Yes. An ICLR 2021 paper by anonymous authors ''Understanding the role of over-parameterization in GANs''. | |||
;Theorem | |||
Consider a WGAN problem. | |||
Suppose the generator is one-hidden layer and the discriminator is linear. | |||
<math>\min_{G} \max_{D} f(G,D)</math> is non-convex concave optimization. | |||
If the generator is sufficiently over-parameterized then Sim GDA converges to a global min-max solution. | |||
==Flow-based Generative Models== | |||
Lecture 16 (October 22, 2020) | |||
Suppose we have a dataset <math>\{x_1,..., x_n\} \subset \mathbb{R}^d</math>. | |||
Our probabilistic model is: | |||
<math>z \sim N(0,I)</math> and <math>x=g_{\theta}(z)</math> where g is bijective and invertible. | |||
We assume <math>f</math> is a differentiable function. | |||
Generation or sampling goes from <math>z</math> to <math>x</math>. | |||
Inference goes from <math>x</math> to <math>z</math>. | |||
Change of variables in 1d: <math>P(z)dz = P(x)dx \implies P(x) = P(z) \frac{dz}{dx}</math>. | |||
In high-dim: <math>P(x) = P(z) | det(\frac{dz}{dx}) |</math>. | |||
;Maximum Likelihood | |||
<math>P_{\theta}(x) = P(z) | det(\frac{dz}{dx})|</math>. | |||
<math> | |||
\begin{aligned} | |||
\max_{\theta} \frac{1}{n} \sum_{i=1}^{n} \log P_{\theta}(x_i) \\ | |||
= \min_{\theta} \frac{1}{n} \sum_{i=1}^{n} \left[ -\log P_{\theta}(x_i) \right]\\ | |||
= \min_{\theta} \frac{-1}{n} \sum_{i=1}^{n} \left[ \log P(z_i) + \log | det(\frac{dz}{dx})| \right] | |||
\end{aligned} | |||
</math> | |||
;Issues | |||
* How to design a bijective function? | |||
* <math>det(J)</math> computation can be very expensive. | |||
;Idea | |||
* Come up with J (i.e. f/g mappings) such that det(J) is easy to compute. | |||
{{hidden | warm-up | | |||
If <math>x=Az+b</math> then <math>z=A^{-1}(x-b)</math>. | |||
<math>J = A^{-1}</math> is expensive to compute. | |||
}} | |||
If <math>J</math> is a diagonal matrix then <math>det(J) = \prod_i J_{ii}</math>. | |||
An <math>f</math> function which is element-wise would have a diagonal jacobian. This is not very expressive. | |||
===Upper-triangular Jacobian=== | |||
RealNVP by [Dint et al] considers an upper-triangular matrix. | |||
In this case, the determinant is still the diagonal. | |||
What does f/g look like? | |||
<math> | |||
\begin{cases} | |||
z_1 = x_1\\ | |||
z_2 = s_\theta \cdot x_2 + t_\theta | |||
\end{cases} | |||
</math> | |||
Now our jacobian is: | |||
<math> | |||
J = \begin{pmatrix} | |||
I & 0\\ | |||
\frac{dz_2}{dx_1} & diag(S_\theta) | |||
\end{pmatrix} | |||
</math> | |||
and <math>det(J) = \prod (S_\theta)_i</math>. | |||
Is this expressive enough? No. | |||
===Composition of transformations=== | |||
<math>f = f_1 \circ f_2 \circ ... \circ f_k</math> | |||
A sequence of invertible transformations is called normalizing flows. | |||
Here, we have the following properties | |||
* <math>(f_1 \circ f_2)^{-1} = f_2^{-1} \circ f_1^{-1}</math> | |||
* <math>\nabla(f_2 \circ f_1)(x) = \nabla f_2(f_1(x)) \nabla f_1(x)</math> | |||
* <math>det(J_1 J_2) = det(J_1) det(J_2)</math> | |||
RealNVP uses two types of partitioning | |||
* Checkerboard | |||
* Channel partitioning | |||
In Glow [Kingmen et al], they use invertible 1x1 convolution which is equivalent to matrix multiplication. | |||
Use <math>W = P L (U+diag(s))</math> and <math>| det(W)| = sum(\log |S|)</math> | |||
===Dequantization=== | |||
Fitting a ''density'' function to discrete data can have crazy peaks. | |||
For dequantization, add a uniform to the data to get a more stable density function. | |||
<math> | |||
\begin{aligned} | |||
\log \int p(x_i +\delta) p(\delta)ds &= \log E_\delta [\log p(x_i + \delta)]\\ | |||
&\geq E_{\delta}[\log p(x_i + \delta)]\\ | |||
&\approx \log p(x_i + \delta) | |||
\end{aligned} | |||
</math> | |||
* We have exact likelihood estimations. | |||
** We can use out-of-distribution anomaly detection. | |||
However in practice after training on CIFAR, the liklihood of MNIST is higher. | |||
This behavior is not specific to flow-based models. | |||
Suppose <math>P_{\theta}</math> is <math>N(0, I_d)</math>. | |||
Our typical example has <math>\Vert x_i \Vert^2 = O(d)</math>. | |||
Consider <math>x^{test} = 0</math> then <math>P_{\theta}(x^{test}) > P_{\theta}(x_1)</math>. | |||
==Domain Adaptation== | |||
So far, we have a training set <math>\{(x_i^{(train)}, y_i^{(train)})\}</math> from distribution <math>Q_{X,Y}</math>. | |||
We learn optimal parameters <math>\theta^*</math> via ERM. | |||
Then at test time, our test samples come from the same distribution <math>Q_{X,Y}</math>. | |||
However in practice, the training distribution can be different from the test distribution. | |||
The training distribution is the source domain. The test distribution is the target domain. | |||
;Examples | |||
Q may be synthetic samples and P may be real samples. | |||
Q contains samples with white background but P has samples with real backgrounds. | |||
In training: | |||
For the source domain, we have labeled samples <math>\{(x_i^S, y_i^S)\}_{i=1}^{m_S} \sim Q_{X,Y}</math>. | |||
For the target domain, we only have unlabeled samples <math>\{x_i^t\} \sim P_{X}</math>. | |||
This is ''unsupervised'' domain adaptation. | |||
In ''semi-supervised'' domain adaptation, your target samples are mostly unlabeled but contain a few labeled samples. | |||
If no target samples are available during training, the scenario is called ''domain generalization'' or ''out-of-distribution'' generalization. | |||
===Unsupervised domain adaptation=== | |||
Given <math>m_s = m_t = m</math>. | |||
* m labeled samples from source domain Q | |||
* m unlabeled samples from target P | |||
* This problem is ill-defined | |||
;Practical assumptions. | |||
# Covariate shifts: P and Q satisfy the covariate shift assumption if the conditional label dist doesn't change between source and target. | |||
#* I.e. <math>P(y|x) = Q(y|x)</math> | |||
# Similarity of source and target marginal distributions. | |||
# If I had labeled target samples, the joint error (target + source samples) should be small. | |||
[Ben-David et al.] consider binary classification. | |||
;H-divergence | |||
<math>2 \sup_{h \in H}|Pr_{x \sim Q_{X}}(h(x)=1) - Pr_{x \sim P_{X}}(h(x)=1)| = d_{H}(Q_X, P_X)</math> | |||
;Lemma | |||
<math>d_{H}(Q_X, P_X)</math> can be estimated by m samples from source and target domains. | |||
<math>VC(H)=d</math> then with probability <math>1-\delta</math> | |||
<math>d_{H}(Q_X, P_X) \leq d_{H}(Q_{X}^{(m)}, P_{X}^{(m)}) + 4 \sqrt{\frac{d \log(2m) + \log(2/\delta)}{m}}</math> | |||
# Label source examples as 1 and label target samples as 0. | |||
# Train a classifier to classify source and target. | |||
# If loss is small then divergence is large. <math>\frac{1}{2} d_{H}(Q_X^{(m)}, P_X^{(m)}) = 1-loss_{class}</math> | |||
===Recap=== | |||
Beginning of Lecture 18 (Oct. 29, 2020) | |||
Given labeled examples from the source domain: <math>Q_{X,Y} = \{(x_i^S, y_i^S)\}_{i=1}^{m_s}</math>. | |||
Target domain: <math>P_{X} = \{x_i^t\}_{i=1}^{m_t}</math>. | |||
Learn a function <math>h \in H</math>. | |||
<math>\epsilon^T(h) = E_{(x,y) \sim P_{X,Y}}[ l(h(x), y) ]</math>. | |||
H-divergence: | |||
<math>d_H(Q_X, P_X) = 2\sup_{h \in H} | P_{Q}(h(x)=1) - P_{P}(h(x)=1)| = 2(1-loss_{classifier})</math>. | |||
This can be estimated by training a classifier to distinguish between train and test samples. | |||
Def: | |||
For the hypothesis class H, the ''symmetric difference hypothesis space'' <math>H \triangle H</math> is the set of disagreements between any two hypothesis in H: | |||
<math>H \triangle H = \{g(x) = h(x) \oplus h'(x) \mid \forall h, h' \in H\}</math>. | |||
;Main Result | |||
<math>H</math> is a hypothesis class with <math>VC(H)=d</math>. | |||
With probability <math>1-\delta</math>, <math>\forall h \in H</math>: | |||
<math>\epsilon_{T}(h) \leq \epsilon_{S}(h) + \frac{1}{2} d_{H \triangle H}(Q_X^{(m)}, P_X^{(m)}) + \epsilon_{joint}</math>. | |||
Target error is <= source error + divergence | |||
===Practical Domain Adaptation Methods=== | |||
;Classical DA methods (pre deep learning) | |||
* Metric Learning | |||
* Subspace alignment | |||
* MMD-based distribution matching | |||
* Sample reweighting & selection | |||
;Modern DA methods (i.e. deep domain adaption) | |||
[https://arxiv.org/abs/1409.7495 Ganin & Lempitsky] | |||
The idea is to train an embedding function using an adversarial domain classifier to extract common features from the source and target domains. | |||
* Input <math>x</math> goes to an embedding function <math>F</math> to get features. | |||
* Features go to a classification network <math>C_1</math> to get labels. | |||
* Features also go to a domain classifier <math>C_2</math>. | |||
* Training: <math>\min_{F, C_1, C_2} E_{(x, y)} \sim Q_{X,Y}^{(m)} [\ell(C_1 \circ F(x), y)] - \lambda L(C_2)</math>. | |||
* In general, we want to find a mapping (embedding) <math>F</math> such that <math>F(Q_X) \approx F(P_X)</math>. | |||
*: The domain classifier penalizes the distance between <math>F(Q_X)</math> and <math>F(P_X)</math>. | |||
Example 1: MMD distance (Maximum mean discrepancy) | |||
Define <math>\tilde{x}_i = F(x_i)</math>. | |||
<math>D_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) \stackrel{\triangle}{=} \Vert \frac{1}{m}\sum \phi(\tilde{x}_i^S) - \frac{1}{m}\sum \phi(\tilde{x}_i^T) \Vert</math> | |||
Here <math>\phi: \mathbb{R}^r \to \mathbb{R}^D</math> is a fixed kernel function. | |||
We square D to apply the kernel trick: | |||
<math>D^2_{MMD}(Q^{(m)}_{\tilde{x}}, P^{(m)}_{\tilde{x}}) = \frac{1}{m^2}\left( \sum_{i,j=1}^{m}K(x_i^s x_j^s) + \sum_{i,j=1}^{m}K(x_i^t, x_j^t) - 2 \sum_{i,j=1}^{m}K(x_i^t, x_j^t) \right)</math> | |||
MMD-based DA (Tzeng ''et al.'' 2014): | |||
<math>\min L(c_1 \circ F(x^s, y^s) + \lambda D^s_{MMD}(F(x^s, F(x^t))</math> | |||
Example 2: Wasserstein distance | |||
<math>\min L_{cls}(C_1 \circ F(x^S), y^s) + \lambda W(F(x^s, F(x^s))</math> | |||
The wasserstein distance is computed using the Kantorovich duality. | |||
This is also called IPM (Integral prob. metrics) distance. | |||
* We can also use improved & robust version of Wasserstein in DA. | |||
** Robust Wasserstein [Balaji ''et al.'' Neurips 20] | |||
** Normalized Wasserstein [....ICCV] | |||
===CycleGAN=== | |||
[Zhu et al 2017] | |||
Another approach for image-to-image translation. | |||
Source: <math>(x^s, y^s)</math> | |||
Target: <math>x^t</math> | |||
Train two functions: <math>G_{S \to T}</math> and <math>G_{T \to S}</math>. | |||
Losses: | |||
* <math>L_{GAN}(x^S, x^T, G_{S\to T}(D^T) = E_{x^t}\left[\log D^T(x^t)\right] + E\left[\log(1-D^T(G_{S\ to T}(x^s)) \right]</math>. | |||
* <math>L_{GAN}(x^S, x^T, G_{T \to S}(D^S)</math> | |||
* Cycle consistency: <math>L_{cyc} = E\left[ \Vert G_{T\to S}(G_{S \to T}(x^s)) - x^s \Vert \right] + E \left[ \Vert G_{S \to T}(G_{T\ to S}(x^t)) - x^t \Vert \right]</math> | |||
Other tricks: | |||
* Domain-specific batch norms | |||
* Entropy based regularization | |||
===Are assumptions necessary?=== | |||
Assumptions: | |||
* Covariate shift | |||
* <math>d_H(Q_x, P_x)</math> is small | |||
* <math>\epsilon_{joint}</math> small | |||
See [Ben-David ''et al.'']. | |||
* Covariate shift assumption is not sufficient. | |||
* Necessity of small <math>d_{H}(P,Q)</math> for DA. | |||
* Necessity of small joint training error. | |||
===Domain Generalization=== | |||
Also known as out-of-dist (OOD) generalization. | |||
Training: <math>|E|</math> training domains (envs) | |||
<math>P^{(e)} \sim \{(x_i^e, y_i^e)\}_{i=1}^{m_e}</math> with <math>1 \leq e \leq |E|</math>. | |||
Goal: Find <math>h \in H</math> that performs well in an unseen domain (domain <math>|E|+1</math>). | |||
At test time <math>P^{(K+1)} \sim \{(x_i^{(k+1)}, y_i^{(k+1)})\}_{i=1}^{m_{(k+1)}} = E[\ell(h(x),y)]</math>. | |||
<math>R^{(k+1)}(h) = E_{(x,y) \sim P^{(k+1)}}[\ell(h(x), y)]</math>. | |||
;Example datasets | |||
* [http://ai.bu.edu/M3SDA/ DomainNet] | |||
** 6 domains: painting, photo, sketches, drawings, clipart, infograph | |||
* PACS | |||
** 4 domains: art, cartoon, photo, sketch | |||
* Some Simpler datasets | |||
** Rotated MNIST | |||
** Color MNIST | |||
One generalization method is to do nothing, just training normally with ERM. | |||
===Domain-adversarial neural networks (DANN)=== | |||
* Train a feature extractor <math>\phi</math> and a classifier <math>w</math> to yield <math>f=w \circ \phi</math>. | |||
* Domain classifier <math>c</math>. | |||
* <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda L(domain\_classification)</math>. | |||
* We optimize <math>\min_{\phi, w} loss</math> | |||
* We can use Wasserstein distance: | |||
** <math>loss = \frac{1}{k} \sum_{j=1}^{k} E[\ell(w \circ \phi(x), y)] + \lambda \sum w(P_{X|Y}^{j_1}, P_{X|Y}^{j_2})</math> | |||
** This is solved using alternating gradient descent. | |||
===Meta Learning for Domain Generalization=== | |||
[Li ''et al.'' 2017] | |||
Idea: Build meta-test domains | |||
;Meta-train | |||
* Loss: <math>L_{meta\_train(\theta) = \frac{1}{K-K_1} \sum_{j} E[\ell(f_{\theta'}(x), y)]</math> | |||
* Take one-gradient step to update the model: | |||
** <math>\phi' = \phi - \eta \nabla L_{metatrain}(\theta)</math> | |||
Overall objective: | |||
* <math>\min_{\theta} L_{\text{meta-train}}(\theta) + \beta L_{\text{meta-test}}(\theta')</math> | |||
To update <math>L_{meta}(\theta)</math>, we need to compute <math>\nabla L_{meta}(\theta)</math> which depends on the Hessian wrt <math>\theta</math>. This can be solved using a ''hessian-vector product'' without computing out the hessian which could be very large. | |||
===Invariant Risk Minimization (IRM)=== | |||
[Arjovsky ''et al.'' (2019)] | |||
Idea: Find a feature extractor <math>\phi()</math> such that optimal classifier is the same for every domain. | |||
Define: <math>R^e(\phi, w) = E[\ell (w_0 \phi(x), y)]</math>. | |||
Objective: <math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e (\phi, w)</math> s.t. <math>\forall e</math>, <math>\hat{w} \in \operatorname{argmin}_{\beta} R^e(\phi, \beta)</math> | |||
This is a bi-level optimization which is difficult to solve. The constraint depends on another optimization. | |||
The paper uses a lagrangian relaxation: | |||
<math>\min_{\phi, \hat{w}} \frac{1}{k} \sum R^e(\phi, \hat{w}) + \lambda \Vert \nabla_{\hat{w}} R^e(\phi, \hat{w}) \Vert^2_2</math>. | |||
Argument: If we can solve the optimization, such a function will use only invariant features since non-invariant features will have different conditional distribution with the label. | |||
[Rosenfeld ''et al.'' Oct 2020] | |||
Not a valid argument in general as IRM fails to recover invariant predictors. | |||
===Which method for generalization works the best?=== | |||
[Gultajani & Lopez-Poz] offer the following empirical observations: | |||
* Model selection is critical in domain generalization. | |||
* Training-domain validation set | |||
* Leave-one-domain out validation set | |||
* Oracle selection | |||
;Data augmentation is important in domain generalization | |||
* random crops, random horizontal flips, random color jitter | |||
* image-to-image neural translation | |||
;ERM (doing nothing) outperforms other DG methods!! | |||
* Possible with careful model selection | |||
* Larger models help with domain generalization. | |||
* See DomainBED for code and data. | |||
==Self-supervised Learning== | |||
Lecture 21 (November 10, 2020) | |||
Given data <math>x</math>, we want to find a good representation <math>f(x)</math>. | |||
We can use <math>f(x)</math>. to solve the classification problem more efficiently (e.g. using linear classifiers). | |||
Task 1: Learn a good <math>f(x)</math> from ''unlabeled'' samples. | |||
Task 2: Use <math>f(x)</math> + a few labels to solve classification problem using linear models. | |||
Note that in semi-supervised learning, you have unlabeled examples and a few labelled examples but you know what the task is. | |||
In self-supervised learning, we use ''structure'' in labelled data to create artificial supervised learning problems solved via deep models. | |||
In this process, the learning method ''hopefully'' will create internal representations for the data <math>f(x)</math> useful for downstream tasks. | |||
===Image embedding=== | |||
Surprising observation for image embedding: | |||
[Gidaris ''et al.'' ICLR 2018] + [Zhang et al. 2019] | |||
# Rotate images and use the angle of rotation as labels (e.g. <math>\theta = 0, 90, 180, 270</math>). | |||
# Train a CNN to predict the rotation angle from images. | |||
# Use <math>f(x)</math> with linear classification models for the true labels. | |||
;Why should <math>f(x)</math> be a good representation for images? | |||
This is an open question. | |||
===Contrastive Learning=== | |||
[Logeswaren & Lee ICLR 2018] use a text corpus (Wikipedia) to train deep representation <math>f(x)</math>: | |||
<math>x, x^{+}</math> are adjacent sentences. | |||
<math>x, x^{-}</math> are random sentences. | |||
Optimization: | |||
<math>\min_{f} E[\log(1 + \exp^{f(x)^T f(x^-) - f(x)^T f(x^+)]})] \approx E[f(x)^T f(x^-) - f(x)^T f(x^+)]</math>. | |||
This is known as contrastive learning. | |||
Sentence embeddings capture human notions of similarities: | |||
E.g. the tiger rules this jungle is similar to a lion hunts in a forest. | |||
;Can we use contrastive learning to obtain power data representations for images? | |||
We need pairs of similar images and dissimilar images. | |||
;SimCLR [Chen ''et al.'' 2020] | |||
# Create two correlated views of an image <math>x</math>: <math>\tilde{x}_i</math> and <math>\tilde{x}_j</math>. | |||
#* Random cropping + resize | |||
#* Random color distortion | |||
#* Random Gaussian blur | |||
# Use base encoder (ResNet) to map <math>\tilde{x}_i,\tilde{x}_j</math> to embeddings <math>h_i, h_j</math>. | |||
# Train a project head <math>g()</math> (one hidden layer MLP) to map h's to z's which maximize the agreement between z's. | |||
# Loss function: <math>sim(z_i, z_j) = \frac{z_i^t z_j}{\Vert z_i \Vert \Vert z_j \Vert}</math> | |||
Randomly select <math>N</math> samples and add their augmentations to get 2N samples. | |||
Compute similarity matrix <math>S \in \mathbb{R}^{2N \times 2N}</math>. | |||
<math>S_{ij}=\exp(sim(z_i, z_j)) = | |||
\begin{cases} | |||
1 & \text{if }i=j\\ | |||
high & \text{if }{j=2k\text{ and }i=2k-1}\\ | |||
low & otherwise | |||
\end{cases} | |||
</math> | |||
Training is <math>\min_{f,g} L = \frac{1}{N} \sum_{k=1}^{N} \frac{l(2k-1,2k) + l(2k, 2k-1)}{2}</math>. | |||
Practical observations: | |||
* Composition of data augmentation is important. | |||
* Larger batch sizes and longer training helps self-supervised learning! | |||
* Optimization over the MLP projection head <math>g</math> helps! | |||
** f is able to keep some information useful for the classification (e.g. color or orientation) | |||
Empirical results: | |||
* For image net, top-1 accuracy increases as number of parameters increases. | |||
* After learning embeddings <math>f</math>, you don't need much labeled data for supervised-training. | |||
===Theory of self-supervised learning=== | |||
;[Arora ''et al.'' 2019] | |||
Modeling of similar pairs: | |||
* <math>x \sim D_{C1}(x)</math> | |||
* <math>x^+ \sim D_{C1}(x)</math> is semantically similar | |||
* <math>x^- \sim D_{C2}(x)</math> negative samples | |||
Downstream task: | |||
* pairwise classification | |||
** nature picks two classes C1, C2 | |||
** generate samples from C1 & C2 and evaluate the classification loss | |||
* Assume <math>m \to \infty</math> so just look at population loss | |||
Notations: | |||
* <math>L_{sup}(f)</math> is the supervised learning loss with optimal last layer | |||
* <math>l=E_{(x, x^+) \sim D_{sim}, x^- \sim D_{net}}[\log(1+\exp(f(x)^T f(x^-) - f(x)^Tf(x^+)]</math> is a logistic loss | |||
Result 1: | |||
* <math>L_{sup}^{mean}(f) \leq \frac{1}{1-\tau}(L_{un}(f)-\tau)</math> for all <math>f \in F</math> | |||
** <math>\tau</math> is the collision probability for pair of random classes. | |||
Idea result: | |||
* <math>L_{sup}(f^*) \leq \alpha L_{sup}(f)</math> forall <math>f</math>. | |||
* In general it is impossible to get this. | |||
;[Tosh et al. 2020] | |||
This work connects self-supervised learning with multi-view representation theory. | |||
Start with (x, z, y) where x,z are two views of the data and y is the label. | |||
* x and z should share redundant information w.r.t. y. I.e. predicting y from x or z individually should be equivalent to predicting y from both. | |||
* <math>e_x = E(E[Y|X] - E[Y|X,Z])^2</math> and <math>e_z = E(E[Y|Z] - E[Y|X,Z])^2</math> should be small | |||
* Formulate contrastive learning as an artificial ''classification'' problem: | |||
** Classify between <math>(x, z, 1)</math> and <math>(x, \tilde{z}, 0)</math>. | |||
** <math>g^*(x,z) = \operatorname{argmin}(\text{classification loss}) = \frac{P_{X,Z}(x,z)}{P_X(x) P_Z(z)}</math>. | |||
* x,z have redundant info from Y so if we first predict z from x, then use the result to predict Y, we should be good | |||
<math> | |||
\begin{aligned} | |||
\mu(x) &= E[E[Y|Z]|X=x] \\ | |||
&= \int E[Y|Z=z] P_{Z|X}(z|x) dz\\ | |||
&= \int E[Y|Z=z] g^*(x,z) P_Z(z) dz | |||
\end{aligned} | |||
</math> | |||
Lemma: <math>E[(\mu(x) - E[Y | x,z])^2] \leq e_x + e_z + 2\sqrt{e_x e_z}</math> | |||
Landmark embedding: <math>g^*</math> is computed via contrastive learning. | |||
How to embed x? <math>\phi(x) = (g^*(x, z_1),...,g^*(x,z_m))</math> | |||
z's are randomly generated from <math>P_Z</math>. | |||
Each <math>g^*</math> output is a single number. | |||
* <math>\phi()</math> is good embedding if a linear classifier of <math>\phi</math> is approx <math>\mu(x)</math>. | |||
** I.e. <math>\exists w : w^t \phi(x) \approx \mu(x)</math> | |||
** <math>w^t \phi(x) = \frac{1}{m} \sum_{i=1}^{m} E[Y|Z_i] g^*(x, z_i) \stackrel{m \to \infty}{\rightarrow} \int E[Y|Z=z] g^*(x,z)P_{Z}*z dz = \mu(x)</math> | |||
Direct embedding | |||
Instead of learning a bivariate <math>g(x,z)</math>, learn <math>\eta(x)^T \psi(z)</math>. | |||
Lemma: For every <math>\eta: x \to \mathbb{R}^m</math> and <math>\psi:z \to \mathbb{R}^m</math>, there exists <math>w \in \mathbb{R}^m</math> such that <math>E[(w^t \eta(x) - \mu(x))^2] \leq E[Y^s \epsilon_{direct}(\eta, \psi)</math> where <math>\epsilon_{direct} = [(\eta(x)^t \psi(x) - g^*(x,z))^2</math>. | |||
Can we write <math>g^*(x,z)</math> as <math>\eta(x)^T \psi(z)</math>? | |||
* If there is a hidden variable H s.t. X and Z are conditionally independent given H. | |||
# Case 1: H is a discrete variable then <math>g^*(x, z) = \eta^*(x)^t \psi(z^*)</math> | |||
# Case 2: There exists <math>\eta,\psi</math> such that <math>E(\eta(x)^t \psi(z) - g^*(x,z))^2 \leq o(\frac{1}{m})</math>. | |||
==Meta Learning== | |||
In many applications, we don't have a large training dataset. However, humans can adapt and learn ''on the fly''. | |||
The key is to use prior knowledge in performing new tasks. | |||
How can we train AI/ML models similarly? | |||
Goal of meta learning: Train a model on different learning tasks such that it can solve new tasks using only a small number of training samples. | |||
Few shot classification problem: | |||
Inputs: <math>D_{metatrain} = \{(D_i^{train}, D_i^{test})\}_{i=1}^{n}</math>. | |||
<math> | |||
\begin{cases} | |||
D_{i}^{train} = \{(x_j^i, y_j^i) \}_{j=1}^{k}\\ | |||
D_{i}^{test} = \{(x_j^i, y_j^i) \}_{j=1}^{k'} | |||
\end{cases} | |||
</math> | |||
* <math>i</math> is the index of the task | |||
* <math>j</math> is the index of the sample | |||
K-shot classification means k training samples per label. Some papers use k as the size of the whole training set. | |||
Q: How to use <math>D_{metatrain}</math> to solve meta-test tasks more effectively? | |||
* Use <math>D_{metatrain}</math> to learn meta parameters <math>\theta</math> such that: | |||
* Base learner <math>A</math> outputs task-specific model parameters <math>\phi_i = A(D_{i}^{train}, \theta)</math> good for <math>D_{i}^{test}</math>. | |||
* Training procedure: | |||
** Loss: <math>\min_{\theta} \sum_{i=1}^{n} loss(D_{i}^{test}, \phi_i)</math> where <math>\phi_i = A(D_{i}^{train}, \phi)</math>. | |||
* Test time: given a new task <math>(D^{train}, D^{test})</math>. Apply <math>A(D^{train}, \theta^*)</math> to get <math>\phi</math>. | |||
[Fin et al. 2017] | |||
* Idea: train a model that can be easy to fine-tune on new tasks | |||
* one-step fine tuning: <math>\theta - \alpha \nabla L(\theta, D_{i}^{train})</math> gives us <math>\phi_{i}</math> | |||
* Evaluate <math>\phi_i</math> on <math>D_i^{test}</math> | |||
* <math>\min_{\theta} \sum_{i=1}^{n} L(\phi_i, D_i^{test}) = L(\theta - \alpha \nabla L(\theta, D_i^{train}), D_i^{test})</math> | |||
1. Model-agnostic meta learning (MAML) | |||
* Use GD to optimize over <math>\theta</math> | |||
** <math>\nabla_\theta = \sum_{i=1}^{n}(\nabla_{\theta} \phi_i) \nabla_{\phi} L(\phi_i, D_i^{test})</math> | |||
** <math>(\nabla_{\theta} \phi_i)</math> involves second-order derivatives which are expensive. | |||
* First-order MAML: just ignore <math>\nabla_{\theta} \phi_i</math> term. Replace it with the identity matrix. | |||
Reptile (Michal et al 2018) | |||
2. <math>A</math> is a simple linear/non-parametric learning on data embeddings computed via <math>f_{\theta}</math> | |||
[Lee et al. 2019] | |||
* <math>f_{\theta}</math> is used to compute embeddings | |||
* <math>A</math> is a linear classifier (e.g. SVM) | |||
* Use dual form of SVM so # of optimization variables = # training samples * # classes | |||
3. <math>A</math> is a non-parametric learning | |||
* Embedding: <math>\tilde{x} = f_{\theta}(x)</math> | |||
[Snell et al. 2017) | |||
* Define prototypes (cluster centers) | |||
** <math>c_k = \frac{1}{|D_i^{tr}|} \sum_{(x,y) \in S_k} f_{\theta}(x)</math> | |||
** <math>P_{\theta}(y=k|x) = \frac{\exp(-d(f_{\theta}(x), c_k))}{\sum_{k'} \exp(-d(f_{\theta}(x), c_k))}</math> | |||
4: <math>A</math> is a black box (e.g. LSTM). | |||
==LSTMs and transformers== | |||
Lecture 23 (November 19, 2020) | |||
===Recurrent Neural Networks (RNNs)=== | |||
Hidden state: <math>h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t)</math> | |||
Prediction at t: <math>y_t = W_{hy} h_{t}</math> | |||
;Backpropagation through time | |||
If <math>W</math> has largest singular value < 1, then gradient vanishes. | |||
If <math>W</math> has largest singular value > 1, then gradient explodes. | |||
Typically, gradient vanishes because of initialization of <math>W</math>. | |||
===Long Short Term Memory (LSTMs)=== | |||
Goal is to solve the vanishing and exploding gradient problem. | |||
LSTM has several gates: | |||
* Input gate: <math>i_t = \sigma(W_{xi}x_t + W_{hi}h_{t-1} + b_i)</math> | |||
* Forget gate: <math>f_t = \sigma(W_{xf}x_t + W_{hf}h_{t-1} + b_f)</math> | |||
* Output Gate: <math>o_t = \sigma(W_{xo}x_t + W_{ho}h_{t-1} + b_o)</math> | |||
* Cell state: <math>c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t</math> | |||
* Hidden state: <math>h_t = o_t \odot \tanh(c_t)</math> | |||
;Bidirectional RNNs | |||
First LSTM takes input in correct order. | |||
Second LSTM takes input in reverse order. | |||
Concatenate outputs from both LSTMs. | |||
===Attention=== | |||
Goal: To help memorize long source sentences in machine translation. | |||
;Encoder-decoder attention | |||
;Self-Attention | |||
===Transformer=== | |||
;Positional encoding | |||
;Self-Attention | |||
Have queries, keys, and values. | |||
Multiply queries with keys, pass through softmax. Then times values. | |||
Yields attention of every work with respect to another. | |||
Initially, transformers had n=8 heads giving 8 queries, keys, and values. | |||
;Architecture | |||
Stack encoders. | |||
==Interpretability== | |||
;Interpretability Methods | |||
* Built-in model interpretability | |||
* Feature level interpretability | |||
* Instance based explanations | |||
We will focus on feature level interpretability. | |||
===Feature Level Interpretability=== | |||
These are given through saliency maps. | |||
* Perturbation-based: Perturb the input to get another output and compute the difference. | |||
* Gradient-based | |||
===Gradient-based Methods=== | |||
Take the derivative of the output with respect to the input. | |||
;Limitations | |||
* Too local and sensitive to slight perturbations | |||
* Saturated outputs lead to unintuitive gradients | |||
* Discontinuous gradients are problematic | |||
;SmoothGrad | |||
* Add gaussian noise to input and average the gradient. | |||
;Integrated Gradients | |||
* Average the gradients along path from baseline to input. | |||
;DeepLift | |||
* We don't care about gradient but the slope relative to the ''reference'' state | |||
;Limitations | |||
* Models must be able to compute the gradient of the output with respect to the input | |||
* Interpretation of neural networks is fragile | |||
** Saliency maps are uninterpretable for adversarial examples on clean models and adversarially trained models. | |||
* Needs white-box gradient access to the model. | |||
===Evaluation of interpretability methods=== | |||
* Human evaluation | |||
** Can humans evaluate saliency? | |||
* Accuracy drop after removing ''salient'' features | |||
* Sanity checks | |||
** Model parameter randomization test - compare output of saliency method on trained vs untrained method to make sure saliency depends on model parameters. | |||
* Synthetic Data | |||
* Data randomization test | |||
** Train on random labels and see if saliency depends on relationship between input & output. | |||
Temporal saliency Rescaling | |||
* If you remove this feature, how is the gradient going to change. | |||
==Reinforcement Learning== | |||
Lecture 27 (December 1, 2020) | |||
;Goal: Sequential decision making problems | |||
* Current decisions can affect future decisions | |||
Examples: games, robotics, self-driving, finance | |||
Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>. | |||
;Definitions | |||
* <math>S</math> is the state space. This can be discrete or continuous. | |||
* <math>O</math> are the observations. | |||
* <math>A</math> is the set of actions which can be discrete or continuous. | |||
* <math>P</math> are the transition probabilities which model the environment. | |||
Example: | |||
* <math>|S|</math> = 2 | |||
* <math>|A|</math> = 2 | |||
In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>. | |||
The given variables are the trajectory <math>T_{t-1}</math>. | |||
Modeling this can be very complex for large <math>T</math>. | |||
Thus, we apply a markov assumption: | |||
* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math> | |||
Objective: To maximize the rewards | |||
* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math> | |||
Two regimes for rewards | |||
* Finite horizon; T is finite | |||
* Infinite horizon; T goes to infinity | |||
* Discount factor | |||
===Classical RL=== | |||
Finite Markov Decision Process (MDP) | |||
* S: Finite number of states | |||
* A: Finite number of actions | |||
* <math>R_t = R(a_t, S_t)</math> | |||
* r discount factor | |||
Goal: Choose actions to maximize the total rewards. | |||
* Policy function <math>\pi</math> determines how actions should be taken. | |||
* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions. | |||
* Assume this is time invariant. | |||
* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>. | |||
* Q function (action-value function) | |||
** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math> | |||
* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>? | |||
** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math> | |||
* Use Bellman's equation | |||
** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math> | |||
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | |||
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math> | |||
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. | |||
All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1. | |||
This implies <math>I - \gamma P_{\pi}</math> is always invertible. | |||
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>. | |||
===Value-Iteration (Bellman Recursion)=== | |||
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | |||
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> | |||
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. | |||
;Claim: <math>L_{\pi}</math> is a contraction. | |||
Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>. | |||
<math> | |||
\begin{aligned} | |||
L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\ | |||
&= \gamma P_{\pi}(v_1 - v_2) | |||
\end{aligned} | |||
</math> | |||
<math> | |||
\begin{aligned} | |||
\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\ | |||
& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\ | |||
& \leq \Vert v_1 - v_2 \Vert | |||
\end{aligned} | |||
</math> | |||
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>. | |||
===Optimal Policy=== | |||
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>. | |||
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. | |||
The policy may not be unique but if two policies are equal, they have the same value function. | |||
Intermediate questions: | |||
* Given <math>v^*</math>, can we compute <math>Q^*</math>? | |||
<math> | |||
\begin{aligned} | |||
Q^*(s, a) &= \max_{\pi} Q(s,a)\\ | |||
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\ | |||
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\ | |||
\end{aligned} | |||
</math> | |||
* Given <math>Q^*</math>, can we compute <math>v^*</math>. | |||
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. | |||
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math> | |||
;How to compare optimal policies? | |||
First approach: Value Iteration | |||
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> | |||
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> | |||
<math>v^*</math> is a fixed point of <math>L(v)</math>. | |||
Claim: <math>L</math> is also a contraction. | |||
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. | |||
===Value Iteration=== | |||
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math> | |||
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math> | |||
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math> | |||
===Policy Iteration=== | |||
Directly optimize policy. | |||
* Start with a random policy <math>\pi^{(0)}</math> | |||
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration | |||
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math> | |||
* Find the best action to be taken at state <math>s</math>: | |||
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math> | |||
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math> | |||
* Repeat until convergence. | |||
Policy iteration is guaranteed to converge to an optimal policy. | |||
Oftentimes converges faster than value iteration. | |||
===Deep Reinforcement Learning=== | |||
;Relaxing some unrealistic assumptions | |||
# Evaluate <math>v_{\pi}(s)</math> | |||
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math> | |||
# Improve the policy | |||
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math> | |||
#* Assumption <math>|S|=d</math> | |||
# How to represent <math>V(s)</math>? | |||
#* Can we use a neural network to represent <math>V(s)</math>? Yes | |||
;How to train <math>v_{\phi}</math>? | |||
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. | |||
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> | |||
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: | |||
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> | |||
* Iterate | |||
;Similarly we can parameterize the Q function | |||
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> | |||
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL). | |||
With model-free RL: | |||
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> | |||
This is called ''Q-Learning''. | |||
;What if we have continuous actions: | |||
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve | |||
** [Gu ''et al.'' 2016] use quadratic functions. | |||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | |||
===Policy Gradient Method=== | |||
Lecture 29 (Dec 8, 2020) | |||
Probability of observing a trajectory: | |||
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> | |||
Reward for a trajectory: | |||
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> | |||
The average reward is: | |||
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math> | |||
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>. | |||
Gradient of the average reward: | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\ | |||
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\ | |||
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\ | |||
&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\ | |||
&= E[\log P_\theta(\tau) R(\tau)] | |||
\end{aligned} | |||
</math> | |||
<math> | |||
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) | |||
</math> | |||
Implies, | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &= E[...]\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) .... | |||
\end{aligned} | |||
</math> | |||
;Summary | |||
* Sample trajectories | |||
* Approximate <math>\nabla_{\theta} J(\theta)</math> | |||
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math> | |||
;Intuition | |||
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> | |||
Formalizing ''trial & error''. | |||
[Finn & Levin, ICML] | |||
===Issues with Policy Gradient=== | |||
;High variance of gradient estimation | |||
;Solutions | |||
* Subtract a baseline | |||
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math> | |||
* Reward-to-go | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
\end{aligned} | |||
</math> | |||
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate. | |||
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math> | |||
===Actor-critic algorithms=== | |||
Have an actor <math>\pi_{\theta}</math>. | |||
Have a critic <math>V_{\phi}/Q</math> | |||
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math> | |||
===Other topics in RL=== | |||
* Inverse RL | |||
* Multi-agent RL | |||
* Model-based RL | |||
==Summary of Course== | |||
;What we covered | |||
* Supervised DL | |||
* Unsupervised DL (GANs, VAEs) | |||
* Self-supervised DL | |||
* Meta-Learning | |||
* Learning with Attention (Transformers) | |||
* Deep RL | |||
* Optimization | |||
* Generalization | |||
* Robustness | |||
* Interpretability | |||
;What we didn't cover | |||
* Fairness | |||
* Privacy & Ethics | |||
* Bayesian DL | |||
* Federated Learning | |||
* Graph NNs | |||
;Things which may be on the final | |||
* Transformers | |||
* Wasserstein distance | |||
* Kernel methods | |||
==Resources== | ==Resources== | ||
Line 454: | Line 2,236: | ||
<ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref> | <ref name="zhang2017understanding">Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals (2017) Understanding deep learning requires rethinking generalization (ICLR 2017) [https://arxiv.org/abs/1611.03530 https://arxiv.org/abs/1611.03530]</ref> | ||
<ref name="belkin2019reconciling">Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (2019) Reconciling modern machine learning practice and the bias-variance trade-off (PNAS 2019) [https://arxiv.org/abs/1812.11118 https://arxiv.org/abs/1812.11118]</ref> | <ref name="belkin2019reconciling">Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal (2019) Reconciling modern machine learning practice and the bias-variance trade-off (PNAS 2019) [https://arxiv.org/abs/1812.11118 https://arxiv.org/abs/1812.11118]</ref> | ||
<ref name="jiang2019generalization">Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio (2019) Fantastic Generalization Measures and Where to Find Them [https://arxiv.org/abs/1912.02178 https://arxiv.org/abs/1912.02178]</ref> | |||
<ref name="gu2017badnets">Tianyu Gu, Brendan Dolan-Gavitt, Siddharth Garg (2017) BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain [https://arxiv.org/abs/1708.06733 https://arxiv.org/abs/1708.06733]</ref> | |||
}} | }} |