Deep Learning: Difference between revisions

(11 intermediate revisions by the same user not shown)

Line 2:

* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]

* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]

==Basics==

A refresher of [[Machine Learning]] and Supervised Learning.

A refresher of [[Machine Learning]] and [[Supervised Learning]].

===Empirical risk minimization (ERM)===

Line 502:

Line 503:

===Linear Regression===

Assume we have a dataset:

Assume we have a dataset:

<math>y_i \in \mathbb{R}</math>

<math>y_i \in \mathbb{R}</math>

<math>x_i \in \mathbb{R}^d</math>

<math>x_i \in \mathbb{R}^d</math>

GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:

GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:

<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>

Line 532:

Line 533:

</math>

Is this model linear in w? Yes!

Is this model linear in w? Yes!

Is this model linear in x? No!

Is this model linear in x? No!

Apply GD or convex optimization.

Line 542:

Line 543:

* <math>\phi</math> is fixed!

* <math>D = O(d^k)</math>

** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>

** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>

;Kernel Trick:

We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.

We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.

This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.

This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.

K is a PSD matrix.

<math display="inline">K</math> is a PSD matrix.

Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.

;Polynomial Kernels

<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>

<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>

Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.

Many classical techniques can be ''kernelized'':

SVM to Kernel SVM

* SVM to Kernel SVM

Ridge regression to Kernel ridge regression

* Ridge regression to Kernel ridge regression

PCA to Kernel PCA

* PCA to Kernel PCA

===Neural Networks===

Consider a two-layer neural network.

Consider a two-layer neural network.

We can write the output as:

We can write the output as:

<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>

<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>

We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>

We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>

GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>

Init N(0,1)

# Init N(0,1)

Our weights update along a trajectory: w(0), w(1), ...

# Our weights update along a trajectory: w(0), w(1), ...

Each <math>w</math> is a weight matrix.

# Each <math>w</math> is a weight matrix.

Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.

This is called ''lazy'' training.

* Not always the case! Especially for small <math>m</math>.

Since the change in the model weights are not large, we can write the first-order taylor approximation:

Since the change in the model weights are not large, we can write the first-order taylor approximation:

<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>

<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>

This model is linear in <math>w</math>.

This model is linear in <math>w</math>.

<math>\phi(x) = \nabla_{w} f(w_0, x)</math>

<math>\phi(x) = \nabla_{w} f(w_0, x)</math>

The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).

Go back to our 2-layer NN:

These features will not change during the optimization process because we use <math display="inline">w_0</math>

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

Go back to our 2-layer NN:

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>

* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.

Based on law of large numbers, as m goes to infinity,

Based on law of large numbers, as m goes to infinity,

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>

;Q: When is this taylor approximation good?

;Q: When is this taylor approximation good?

If the Hessian has bounded eigenvalues. (Hessian Control)

;Analyze GD:

<math>\eta \to 0</math> Gradient-flow

<math>\eta \to 0</math> Gradient-flow

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>

\begin{aligned}

Line 615:

Line 618:

</math>

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).

Line 2,165:

Line 2,168:

;Intuition

<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>

Formalizing ''trial & error''.

[Finn & Levin, ICML]

===Issues with Policy Gradient===

;High variance of gradient estimation

;Solutions

* Subtract a baseline

* Reward-to-go

<math>

\begin{aligned}

\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\

&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

\end{aligned}

</math>

;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.

Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>

===Actor-critic algorithms===

Have an actor <math>\pi_{\theta}</math>.

Have a critic <math>V_{\phi}/Q</math>

<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>

===Other topics in RL===

* Inverse RL

* Multi-agent RL

* Model-based RL

==Summary of Course==

;What we covered

* Supervised DL

* Unsupervised DL (GANs, VAEs)

* Self-supervised DL

* Meta-Learning

* Learning with Attention (Transformers)

* Deep RL

* Optimization

* Generalization

* Robustness

* Interpretability

;What we didn't cover

* Fairness

* Privacy & Ethics

* Bayesian DL

* Federated Learning

* Graph NNs

~~==Misc==~~

;Things which may be on the final

~~[[Visible to::users]]~~

* Transformers

* Wasserstein distance

* Kernel methods

==Resources==

@@ Line 2: / Line 2: @@
 * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
+* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]
 ==Basics==
-A refresher of [[Machine Learning]] and Supervised Learning.
+A refresher of [[Machine Learning]] and [[Supervised Learning]].
 ===Empirical risk minimization (ERM)===
@@ Line 502: / Line 503: @@
 ===Linear Regression===
-Assume we have a dataset:
+Assume we have a dataset:<br>
 <math>\{(x_i, y_i)\}_{i=1}^{n}</math>
-<math>y_i \in \mathbb{R}</math>
+<math>y_i \in \mathbb{R}</math><br>
-<math>x_i \in \mathbb{R}^d</math>
+<math>x_i \in \mathbb{R}^d</math><br>
 <math>f(w, x) = w^t x</math>
-<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
+<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
-<math>\min_{W} L(w)</math>
+<math>\min_{W} L(w)</math><br>
-GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
+GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
 <math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
@@ Line 532: / Line 533: @@
 </math>
-<math>f(w, x) = w^t \phi(x)</math>
+<math>f(w, x) = w^t \phi(x)</math><br>
-Is this model linear in w? Yes!
+Is this model linear in w? Yes!<br>
-Is this model linear in x? No!
+Is this model linear in x? No!<br>
-<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
+<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
 Apply GD or convex optimization.
@@ Line 542: / Line 543: @@
 * <math>\phi</math> is fixed!
 * <math>D = O(d^k)</math>
-** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
+** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>
 ;Kernel Trick:
-We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
+We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
-This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
+This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
-K is a PSD matrix.
+<math display="inline">K</math> is a PSD matrix.
 Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
 ;Polynomial Kernels
-<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
+<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
 Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
 Many classical techniques can be ''kernelized'':
-SVM to Kernel SVM
+* SVM to Kernel SVM
-Ridge regression to Kernel ridge regression
+* Ridge regression to Kernel ridge regression
-PCA to Kernel PCA
+* PCA to Kernel PCA
 ===Neural Networks===
-Consider a two-layer neural network.
+Consider a two-layer neural network.<br>
-We can write the output as:
+We can write the output as:<br>
-<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
+<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
-We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
+We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
 GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
-Init N(0,1)
+# Init N(0,1)
-Our weights update along a trajectory: w(0), w(1), ...
+# Our weights update along a trajectory: w(0), w(1), ...
-Each <math>w</math> is a weight matrix.
+# Each <math>w</math> is a weight matrix.
 Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
 This is called ''lazy'' training.
 * Not always the case! Especially for small <math>m</math>.
-Since the change in the model weights are not large, we can write the first-order taylor approximation:
+Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
-<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
+<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
-This model is linear in <math>w</math>.
+This model is linear in <math>w</math>.<br>
-<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
+<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
 The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
-Go back to our 2-layer NN:
+These features will not change during the optimization process because we use <math display="inline">w_0</math>
-<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
-<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
+Go back to our 2-layer NN:<br>
+<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
+<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
 <math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
-<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
+<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
-<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
+<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
 <math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
 * <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
-Based on law of large numbers, as m goes to infinity,
+Based on law of large numbers, as m goes to infinity,<br>
-<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
+<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
 <math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
-<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
+<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
 <math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
-;Q: When is this taylor approximation good?
+;Q: When is this taylor approximation good?<br>
 If the Hessian has bounded eigenvalues. (Hessian Control)
 ;Analyze GD:
-<math>\eta \to 0</math> Gradient-flow
+<math>\eta \to 0</math> Gradient-flow<br>
-<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
+<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
 <math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
-<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
+<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
 <math>
 \begin{aligned}
@@ Line 615: / Line 618: @@
 </math>
-If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
+If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
-This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
+This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
 In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
@@ Line 2,165: / Line 2,168: @@
 ;Intuition
+<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>
+Formalizing ''trial & error''.
+[Finn & Levin, ICML]
+===Issues with Policy Gradient===
+;High variance of gradient estimation
+;Solutions
+* Subtract a baseline
+<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
+* Reward-to-go
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+\end{aligned}
+</math>
+;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
+Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
+===Actor-critic algorithms===
+Have an actor <math>\pi_{\theta}</math>.
+Have a critic <math>V_{\phi}/Q</math>
+<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
+===Other topics in RL===
+* Inverse RL
+* Multi-agent RL
+* Model-based RL
+==Summary of Course==
+;What we covered
+* Supervised DL
+* Unsupervised DL (GANs, VAEs)
+* Self-supervised DL
+* Meta-Learning
+* Learning with Attention (Transformers)
+* Deep RL
+* Optimization
+* Generalization
+* Robustness
+* Interpretability
+;What we didn't cover
+* Fairness
+* Privacy & Ethics
+* Bayesian DL
+* Federated Learning
+* Graph NNs
-==Misc==
+;Things which may be on the final
-[[Visible to::users]]
+* Transformers
+* Wasserstein distance
+* Kernel methods
 ==Resources==