Deep Learning: Difference between revisions

(24 intermediate revisions by the same user not shown)

Line 2:

* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]

* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]

==Basics==

A refresher of [[Machine Learning]] and Supervised Learning.

A refresher of [[Machine Learning]] and [[Supervised Learning]].

===Empirical risk minimization (ERM)===

Line 93:

Line 94:

\begin{aligned}

\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\

&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\

&=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\

&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\

&= \lambda_{\min}(K(w)) L(w)\\

Line 502:

Line 503:

===Linear Regression===

Assume we have a dataset:

Assume we have a dataset:

<math>y_i \in \mathbb{R}</math>

<math>y_i \in \mathbb{R}</math>

<math>x_i \in \mathbb{R}^d</math>

<math>x_i \in \mathbb{R}^d</math>

GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:

GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:

<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>

Line 532:

Line 533:

</math>

Is this model linear in w? Yes!

Is this model linear in w? Yes!

Is this model linear in x? No!

Is this model linear in x? No!

Apply GD or convex optimization.

Line 542:

Line 543:

* <math>\phi</math> is fixed!

* <math>D = O(d^k)</math>

** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>

** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>

;Kernel Trick:

We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.

We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.

This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.

This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.

K is a PSD matrix.

<math display="inline">K</math> is a PSD matrix.

Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.

;Polynomial Kernels

<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>

<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>

Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.

Many classical techniques can be ''kernelized'':

SVM to Kernel SVM

* SVM to Kernel SVM

Ridge regression to Kernel ridge regression

* Ridge regression to Kernel ridge regression

PCA to Kernel PCA

* PCA to Kernel PCA

===Neural Networks===

Consider a two-layer neural network.

Consider a two-layer neural network.

We can write the output as:

We can write the output as:

<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>

<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>

We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>

We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>

GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>

Init N(0,1)

# Init N(0,1)

Our weights update along a trajectory: w(0), w(1), ...

# Our weights update along a trajectory: w(0), w(1), ...

Each <math>w</math> is a weight matrix.

# Each <math>w</math> is a weight matrix.

Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.

This is called ''lazy'' training.

* Not always the case! Especially for small <math>m</math>.

Since the change in the model weights are not large, we can write the first-order taylor approximation:

Since the change in the model weights are not large, we can write the first-order taylor approximation:

<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>

<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>

This model is linear in <math>w</math>.

This model is linear in <math>w</math>.

<math>\phi(x) = \nabla_{w} f(w_0, x)</math>

<math>\phi(x) = \nabla_{w} f(w_0, x)</math>

The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).

Go back to our 2-layer NN:

These features will not change during the optimization process because we use <math display="inline">w_0</math>

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

Go back to our 2-layer NN:

<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>

<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>

<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>

<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>

* <math>a_i</math> and <math>b_i</math> are independent samples at initialization.

Based on law of large numbers, as m goes to infinity,

Based on law of large numbers, as m goes to infinity,

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>

<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>

<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>

;Q: When is this taylor approximation good?

;Q: When is this taylor approximation good?

If the Hessian has bounded eigenvalues. (Hessian Control)

;Analyze GD:

<math>\eta \to 0</math> Gradient-flow

<math>\eta \to 0</math> Gradient-flow

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>

<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>

<math>

\begin{aligned}

Line 615:

Line 618:

</math>

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.

In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).

Line 2,020:

Line 2,023:

** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>

** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>

** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math>

** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex.

All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1.

<math>I - \gamma P_{\pi}</math> is invertible.

This implies <math>I - \gamma P_{\pi}</math> is always invertible.

~~This~~ is computationally expensive since it scales with <math>|S|^3</math>

However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.

;Value-Iteration (Bellman Recursion)

===Value-Iteration (Bellman Recursion)===

<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>

Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math>

Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>

<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>.

;Claim: <math>L_{\pi}</math> is a contraction

;Claim: <math>L_{\pi}</math> is a contraction.

Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>.

<math>

\begin{aligned}

L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\

&= \gamma P_{\pi}(v_1 - v_2)

\end{aligned}

</math>

<math>

\begin{aligned}

\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\

& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\

& \leq \Vert v_1 - v_2 \Vert

\end{aligned}

</math>

By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.

===Optimal Policy===

Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.

Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>.

The policy may not be unique but if two policies are equal, they have the same value function.

Intermediate questions:

* Given <math>v^*</math>, can we compute <math>Q^*</math>?

<math>

\begin{aligned}

Q^*(s, a) &= \max_{\pi} Q(s,a)\\

&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\

&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\

\end{aligned}

</math>

* Given <math>Q^*</math>, can we compute <math>v^*</math>.

Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.

<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>

;How to compare optimal policies?

First approach: Value Iteration

<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math>

Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math>

<math>v^*</math> is a fixed point of <math>L(v)</math>.

Claim: <math>L</math> is also a contraction.

Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.

===Value Iteration===

* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>

* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>

* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>

===Policy Iteration===

Directly optimize policy.

* Start with a random policy <math>\pi^{(0)}</math>

* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration

* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>

* Find the best action to be taken at state <math>s</math>:

** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>

* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>

* Repeat until convergence.

Policy iteration is guaranteed to converge to an optimal policy.

Oftentimes converges faster than value iteration.

===Deep Reinforcement Learning===

;Relaxing some unrealistic assumptions

# Evaluate <math>v_{\pi}(s)</math>

#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>

# Improve the policy

#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>

#* Assumption <math>|S|=d</math>

# How to represent <math>V(s)</math>?

#* Can we use a neural network to represent <math>V(s)</math>? Yes

;How to train <math>v_{\phi}</math>?

* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>.

** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math>

* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss:

** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math>

* Iterate

;Similarly we can parameterize the Q function

Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math>

We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).

With model-free RL:

We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math>

This is called ''Q-Learning''.

;What if we have continuous actions:

* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve

** [Gu ''et al.'' 2016] use quadratic functions.

* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>

===Policy Gradient Method===

Lecture 29 (Dec 8, 2020)

Probability of observing a trajectory:

<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>

Reward for a trajectory:

The average reward is:

<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>

Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.

Gradient of the average reward:

<math>

\begin{aligned}

\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\

&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\

&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\

&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\

&= E[\log P_\theta(\tau) R(\tau)]

\end{aligned}

</math>

<math>

\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)

</math>

Implies,

<math>

\begin{aligned}

\nabla_{\theta} J(\theta) &= E[...]\\

&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....

\end{aligned}

</math>

;Summary

* Sample trajectories

* Approximate <math>\nabla_{\theta} J(\theta)</math>

* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>

;Intuition

<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>

Formalizing ''trial & error''.

[Finn & Levin, ICML]

===Issues with Policy Gradient===

;High variance of gradient estimation

;Solutions

* Subtract a baseline

* Reward-to-go

<math>

\begin{aligned}

\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\

&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\

\end{aligned}

</math>

;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.

Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>

===Actor-critic algorithms===

Have an actor <math>\pi_{\theta}</math>.

Have a critic <math>V_{\phi}/Q</math>

<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>

===Other topics in RL===

* Inverse RL

* Multi-agent RL

* Model-based RL

==Summary of Course==

;What we covered

* Supervised DL

* Unsupervised DL (GANs, VAEs)

* Self-supervised DL

* Meta-Learning

* Learning with Attention (Transformers)

* Deep RL

* Optimization

* Generalization

* Robustness

* Interpretability

;What we didn't cover

* Fairness

* Privacy & Ethics

* Bayesian DL

* Federated Learning

* Graph NNs

~~==Misc==~~

;Things which may be on the final

~~[[Visible to::users]]~~

* Transformers

* Wasserstein distance

* Kernel methods

==Resources==

@@ Line 2: / Line 2: @@
 * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website]
+* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos]
 ==Basics==
-A refresher of [[Machine Learning]] and Supervised Learning.
+A refresher of [[Machine Learning]] and [[Supervised Learning]].
 ===Empirical risk minimization (ERM)===
@@ Line 93: / Line 94: @@
 \begin{aligned}
 \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\
-&=\frac{1}{2}(F(w)=y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
+&=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\
 &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\
 &= \lambda_{\min}(K(w)) L(w)\\
@@ Line 502: / Line 503: @@
 ===Linear Regression===
-Assume we have a dataset:
+Assume we have a dataset:<br>
 <math>\{(x_i, y_i)\}_{i=1}^{n}</math>
-<math>y_i \in \mathbb{R}</math>
+<math>y_i \in \mathbb{R}</math><br>
-<math>x_i \in \mathbb{R}^d</math>
+<math>x_i \in \mathbb{R}^d</math><br>
 <math>f(w, x) = w^t x</math>
-<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math>
+<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br>
-<math>\min_{W} L(w)</math>
+<math>\min_{W} L(w)</math><br>
-GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:
+GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br>
 <math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math>
@@ Line 532: / Line 533: @@
 </math>
-<math>f(w, x) = w^t \phi(x)</math>
+<math>f(w, x) = w^t \phi(x)</math><br>
-Is this model linear in w? Yes!
+Is this model linear in w? Yes!<br>
-Is this model linear in x? No!
+Is this model linear in x? No!<br>
-<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math>
+<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br>
 Apply GD or convex optimization.
@@ Line 542: / Line 543: @@
 * <math>\phi</math> is fixed!
 * <math>D = O(d^k)</math>
-** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math>
+** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math>
 ;Kernel Trick:
-We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.
+We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br>
-This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.
+This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br>
-K is a PSD matrix.
+<math display="inline">K</math> is a PSD matrix.
 Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>.
 ;Polynomial Kernels
-<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math>
+<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br>
 Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>.
 Many classical techniques can be ''kernelized'':
-SVM to Kernel SVM
+* SVM to Kernel SVM
-Ridge regression to Kernel ridge regression
+* Ridge regression to Kernel ridge regression
-PCA to Kernel PCA
+* PCA to Kernel PCA
 ===Neural Networks===
-Consider a two-layer neural network.
+Consider a two-layer neural network.<br>
-We can write the output as:
+We can write the output as:<br>
-<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math>
+<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br>
-We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math>
+We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br>
 GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math>
-Init N(0,1)
+# Init N(0,1)
-Our weights update along a trajectory: w(0), w(1), ...
+# Our weights update along a trajectory: w(0), w(1), ...
-Each <math>w</math> is a weight matrix.
+# Each <math>w</math> is a weight matrix.
 Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static.
 This is called ''lazy'' training.
 * Not always the case! Especially for small <math>m</math>.
-Since the change in the model weights are not large, we can write the first-order taylor approximation:
+Since the change in the model weights are not large, we can write the first-order taylor approximation:<br>
-<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math>
+<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br>
-This model is linear in <math>w</math>.
+This model is linear in <math>w</math>.<br>
-<math>\phi(x) = \nabla_{w} f(w_0, x)</math>
+<math>\phi(x) = \nabla_{w} f(w_0, x)</math><br>
 The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK).
-Go back to our 2-layer NN:
+These features will not change during the optimization process because we use <math display="inline">w_0</math>
-<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math>
-<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math>
+Go back to our 2-layer NN:<br>
+<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br>
+<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br>
 <math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math>
-<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math>
+<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br>
-<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math>
+<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br>
 <math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math>
 * <math>a_i</math> and <math>b_i</math> are independent samples at initialization.
-Based on law of large numbers, as m goes to infinity,
+Based on law of large numbers, as m goes to infinity,<br>
-<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math>
+<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br>
 <math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math>
-<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math>
+<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br>
 <math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math>
-;Q: When is this taylor approximation good?
+;Q: When is this taylor approximation good?<br>
 If the Hessian has bounded eigenvalues. (Hessian Control)
 ;Analyze GD:
-<math>\eta \to 0</math> Gradient-flow
+<math>\eta \to 0</math> Gradient-flow<br>
-<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math>
+<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br>
 <math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math>
-<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math>
+<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br>
 <math>
 \begin{aligned}
@@ Line 615: / Line 618: @@
 </math>
-If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.
+If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br>
-This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.
+This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br>
 In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite).
@@ Line 2,020: / Line 2,023: @@
 ** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
 ** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
-** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math>
+** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex.
+All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1.
-<math>I - \gamma P_{\pi}</math> is invertible.
+This implies <math>I - \gamma P_{\pi}</math> is always invertible.
-This is computationally expensive since it scales with <math>|S|^3</math>
+However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
-;Value-Iteration (Bellman Recursion)
+===Value-Iteration (Bellman Recursion)===
 <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
-Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math>
+Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>
 <math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>.
-;Claim: <math>L_{\pi}</math> is a contraction
+;Claim: <math>L_{\pi}</math> is a contraction.
+Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>.
+<math>
+\begin{aligned}
+L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\
+&= \gamma P_{\pi}(v_1 - v_2)
+\end{aligned}
+</math>
+<math>
+\begin{aligned}
+\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\
+& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\
+& \leq \Vert v_1 - v_2 \Vert
+\end{aligned}
+</math>
+By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
+===Optimal Policy===
+Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
+Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>.
+The policy may not be unique but if two policies are equal, they have the same value function.
+Intermediate questions:
+* Given <math>v^*</math>, can we compute <math>Q^*</math>?
+<math>
+\begin{aligned}
+Q^*(s, a) &= \max_{\pi} Q(s,a)\\
+&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
+&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
+\end{aligned}
+</math>
+* Given <math>Q^*</math>, can we compute <math>v^*</math>.
+Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.
+<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
+;How to compare optimal policies?
+First approach: Value Iteration
+<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math>
+Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math>
+<math>v^*</math> is a fixed point of <math>L(v)</math>.
+Claim: <math>L</math> is also a contraction.
+Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.
+===Value Iteration===
+* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
+* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
+* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
+===Policy Iteration===
+Directly optimize policy.
+* Start with a random policy <math>\pi^{(0)}</math>
+* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
+* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
+* Find the best action to be taken at state <math>s</math>:
+** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
+* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
+* Repeat until convergence.
+Policy iteration is guaranteed to converge to an optimal policy.
+Oftentimes converges faster than value iteration.
+===Deep Reinforcement Learning===
+;Relaxing some unrealistic assumptions
+# Evaluate <math>v_{\pi}(s)</math>
+#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>
+# Improve the policy
+#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>
+#* Assumption <math>|S|=d</math>
+# How to represent <math>V(s)</math>?
+#* Can we use a neural network to represent <math>V(s)</math>? Yes
+;How to train <math>v_{\phi}</math>?
+* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>.
+** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math>
+* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss:
+** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math>
+* Iterate
+;Similarly we can parameterize the Q function
+Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math>
+We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).
+With model-free RL:
+We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math>
+This is called ''Q-Learning''.
+;What if we have continuous actions:
+* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve
+** [Gu ''et al.'' 2016] use quadratic functions.
+* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
+===Policy Gradient Method===
+Lecture 29 (Dec 8, 2020)
+Probability of observing a trajectory:
+<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>
+Reward for a trajectory:
+<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math>
+The average reward is:
+<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
+Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
+Gradient of the average reward:
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
+&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
+&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\
+&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\
+&= E[\log P_\theta(\tau) R(\tau)]
+\end{aligned}
+</math>
+<math>
+\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
+</math>
+Implies,
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &= E[...]\\
+&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
+\end{aligned}
+</math>
+;Summary
+* Sample trajectories
+* Approximate <math>\nabla_{\theta} J(\theta)</math>
+* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
+;Intuition
+<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>
+Formalizing ''trial & error''.
+[Finn & Levin, ICML]
+===Issues with Policy Gradient===
+;High variance of gradient estimation
+;Solutions
+* Subtract a baseline
+<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
+* Reward-to-go
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+\end{aligned}
+</math>
+;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate.
+Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math>
+===Actor-critic algorithms===
+Have an actor <math>\pi_{\theta}</math>.
+Have a critic <math>V_{\phi}/Q</math>
+<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math>
+===Other topics in RL===
+* Inverse RL
+* Multi-agent RL
+* Model-based RL
+==Summary of Course==
+;What we covered
+* Supervised DL
+* Unsupervised DL (GANs, VAEs)
+* Self-supervised DL
+* Meta-Learning
+* Learning with Attention (Transformers)
+* Deep RL
+* Optimization
+* Generalization
+* Robustness
+* Interpretability
+;What we didn't cover
+* Fairness
+* Privacy & Ethics
+* Bayesian DL
+* Federated Learning
+* Graph NNs
-==Misc==
+;Things which may be on the final
-[[Visible to::users]]
+* Transformers
+* Wasserstein distance
+* Kernel methods
 ==Resources==