Deep Learning: Difference between revisions
| (24 intermediate revisions by the same user not shown) | |||
| Line 2: | Line 2: | ||
* [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | * [http://www.cs.umd.edu/class/fall2020/cmsc828W/ Course Website] | ||
* [https://www.youtube.com/user/soheilfeiz/videos Lecture Videos] | |||
==Basics== | ==Basics== | ||
A refresher of [[Machine Learning]] and Supervised Learning. | A refresher of [[Machine Learning]] and [[Supervised Learning]]. | ||
===Empirical risk minimization (ERM)=== | ===Empirical risk minimization (ERM)=== | ||
| Line 93: | Line 94: | ||
\begin{aligned} | \begin{aligned} | ||
\frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ | \frac{1}{2}\Vert \nabla f(w) \Vert^2 &= \frac{1}{2}\Vert (F(w)-y)^T \nabla F(w)\Vert^2\\ | ||
&=\frac{1}{2}(F(w) | &=\frac{1}{2}(F(w)-y)^T \nabla F(w) \nabla F(w)^T (F(w)-y)\\ | ||
&\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | &\geq \frac{1}{2} \lambda_{\min}(K(w)) \Vert F(w)-y\Vert ^2\\ | ||
&= \lambda_{\min}(K(w)) L(w)\\ | &= \lambda_{\min}(K(w)) L(w)\\ | ||
| Line 502: | Line 503: | ||
===Linear Regression=== | ===Linear Regression=== | ||
Assume we have a dataset: | Assume we have a dataset:<br> | ||
<math>\{(x_i, y_i)\}_{i=1}^{n}</math> | <math>\{(x_i, y_i)\}_{i=1}^{n}</math> | ||
<math>y_i \in \mathbb{R}</math> | <math>y_i \in \mathbb{R}</math><br> | ||
<math>x_i \in \mathbb{R}^d</math> | <math>x_i \in \mathbb{R}^d</math><br> | ||
<math>f(w, x) = w^t x</math> | <math>f(w, x) = w^t x</math> | ||
<math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math> | <math>L(w) = \frac{1}{2} \sum_{i=1}^{n}(y_i - f(w, x_i))^2</math><br> | ||
<math>\min_{W} L(w)</math> | <math>\min_{W} L(w)</math><br> | ||
GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is: | GD: <math>w(t+1) = w(t) - \eta_{t} \nabla L(w_t)</math> where our gradient is:<br> | ||
<math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math> | <math>\sum_{i=1}^{n}(y_i - f(w, x_i)) \nabla_{w} f(w_t, x_i) = \sum_{i=1}^{n}(y_i - f(w, x_i)) x_i</math> | ||
| Line 532: | Line 533: | ||
</math> | </math> | ||
<math>f(w, x) = w^t \phi(x)</math> | <math>f(w, x) = w^t \phi(x)</math><br> | ||
Is this model linear in w? Yes! | Is this model linear in w? Yes!<br> | ||
Is this model linear in x? No! | Is this model linear in x? No!<br> | ||
<math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math> | <math>\min \frac{1}{2} \sum_{i=1}^{n} (y_i - w^t \phi(x_i))^2</math><br> | ||
Apply GD or convex optimization. | Apply GD or convex optimization. | ||
| Line 542: | Line 543: | ||
* <math>\phi</math> is fixed! | * <math>\phi</math> is fixed! | ||
* <math>D = O(d^k)</math> | * <math>D = O(d^k)</math> | ||
** For ImageNet, d is approx <math>10^5</math> so <math>D=O(10^15)</math> | ** For ImageNet, <math display="inline">d</math> is approx <math display="inline">10^5</math> so <math display="inline">D=O(10^{15})</math> | ||
;Kernel Trick: | ;Kernel Trick: | ||
We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>. | We may have a closed form solution for <math>\langle \phi(x_i), \phi(x_j) \rangle</math>.<br> | ||
This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>. | This is called the kernel function <math>K(x_i, x_j)</math> or kernel matrix <math>K \in \mathbb{R}^{n \times n}</math>.<br> | ||
K is a PSD matrix. | <math display="inline">K</math> is a PSD matrix. | ||
Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>. | Idea: In many cases without "explicit" comp of <math>\phi(x_i)</math>, we can compute <math>K(x_i, x_j)</math>. | ||
;Polynomial Kernels | ;Polynomial Kernels | ||
<math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math> | <math>K(x_i, x_j) = (x + x_i^t x_j)^k</math> with <math>\phi(x_i) \in \mathbb{R}^D</math><br> | ||
Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>. | Here <math>D=O(d^k)</math> but <math>K(x_i, x_j)</math> is <math>O(d)</math>. | ||
Many classical techniques can be ''kernelized'': | Many classical techniques can be ''kernelized'': | ||
SVM to Kernel SVM | * SVM to Kernel SVM | ||
Ridge regression to Kernel ridge regression | * Ridge regression to Kernel ridge regression | ||
PCA to Kernel PCA | * PCA to Kernel PCA | ||
===Neural Networks=== | ===Neural Networks=== | ||
Consider a two-layer neural network. | Consider a two-layer neural network.<br> | ||
We can write the output as: | We can write the output as:<br> | ||
<math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math> | <math>y = f(w, x) = \frac{1}{\sqrt{m}} \sum_{i=1}^{m} b_i \sigma(a_i^t x)</math><br> | ||
We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math> | We use quadratic loss: <math>L(w) = \frac{1}{2} \sum_{i=1}^{n} (f(w, x_i) - y_i)^2</math><br> | ||
GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | GD: <math>w(t+1) = w(t) - \eta_{t} \sum_{i=1}^{n} (f(w, x_i) - y_i) \nabla_w f(w_t, x_i)</math> | ||
Init N(0,1) | # Init N(0,1) | ||
Our weights update along a trajectory: w(0), w(1), ... | # Our weights update along a trajectory: w(0), w(1), ... | ||
Each <math>w</math> is a weight matrix. | # Each <math>w</math> is a weight matrix. | ||
Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | Empirical Observation: When the width of the network <math>m</math> is large, the trajectory of the gradient descent is ''almost'' static. | ||
This is called ''lazy'' training. | This is called ''lazy'' training. | ||
* Not always the case! Especially for small <math>m</math>. | * Not always the case! Especially for small <math>m</math>. | ||
Since the change in the model weights are not large, we can write the first-order taylor approximation: | Since the change in the model weights are not large, we can write the first-order taylor approximation:<br> | ||
<math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math> | <math>f(w, x) \approx f(w_0, x) + \nabla_{w} f(w_0, x)^t (w - w_x) + ...</math><br> | ||
This model is linear in <math>w</math>. | This model is linear in <math>w</math>.<br> | ||
<math>\phi(x) = \nabla_{w} f(w_0, x)</math> | <math>\phi(x) = \nabla_{w} f(w_0, x)</math><br> | ||
The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | The kernel <math>K = \langle \phi(x_i), \phi(x_j) \rangle</math> is called the ''Neural Tangent Kernel'' (NTK). | ||
Go back to our 2-layer NN: | These features will not change during the optimization process because we use <math display="inline">w_0</math> | ||
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math> | |||
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math> | Go back to our 2-layer NN:<br> | ||
<math>f_m(w, x) = \frac{1}{\sqrt{m}} \sum b_i \sigma(a_i^t x)</math><br> | |||
<math>\nabla_{a_i} f_m(w, x) = \frac{1}{\sqrt{m}} b_i \sigma'(a_i^t x) x</math><br> | |||
<math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math> | <math>\nabla_{b_i} f_m(w, x) = \frac{1}{\sqrt{m}} \sigma(a_i^t x)</math> | ||
<math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math> | <math>K_{m}(x, x') = K_{m}^{(a)}(x, x') + K_{m}^{(b)}(x, x')</math><br> | ||
<math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math> | <math>K_{m}^{(a)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} b_i^2 \sigma'(a_i^tx) \sigma'(a_i^tx) (x x')</math><br> | ||
<math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math> | <math>K_{m}^{(b)}(x, x') = \frac{1}{m} \sum_{i=1}^{m} \sigma(a_i^t x) \sigma(a_i^t x')</math> | ||
* <math>a_i</math> and <math>b_i</math> are independent samples at initialization. | * <math>a_i</math> and <math>b_i</math> are independent samples at initialization. | ||
Based on law of large numbers, as m goes to infinity, | Based on law of large numbers, as m goes to infinity,<br> | ||
<math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math> | <math>K_{m}^{(a)}(x, x') \to K^{(a)}(x, x') = E \left[ b^2 \sigma'(a^t x) \sigma'(a^t x') (x x') \right]</math><br> | ||
<math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math> | <math>K_{m}^{(b)}(x, x') \to K^{(b)}(x, x') = E \left[ \sigma(a^t x) \sigma(a^t x') \right]</math> | ||
<math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math> | <math>K^{(a)}(x, x') = \frac{(x x') E[b^2]}{2 \pi} (\pi - \theta(x, x))</math><br> | ||
<math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math> | <math>K^{(b)}(x, x') = \frac{\Vert x \Vert \Vert x' \Vert E[\Vert a \Vert^2]}{2 \pi d} ((\pi - \theta(x, x')) \cos(\theta) + \sin \theta)</math> | ||
;Q: When is this taylor approximation good? | ;Q: When is this taylor approximation good?<br> | ||
If the Hessian has bounded eigenvalues. (Hessian Control) | If the Hessian has bounded eigenvalues. (Hessian Control) | ||
;Analyze GD: | ;Analyze GD: | ||
<math>\eta \to 0</math> Gradient-flow | <math>\eta \to 0</math> Gradient-flow<br> | ||
<math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math> | <math>w(t+1) = w(t) - \eta \nabla_{w} L(w(t)) \implies \frac{w(t+1) - w(t)}{\eta} = - \nabla_{w} L(w(t))</math><br> | ||
<math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math> | <math>\to \frac{dw(t)}{dt} = -\nabla_{w} L(w(t))</math> | ||
<math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math> | <math>\frac{dw(t)}{dt} = -\nabla_{w} \hat{y}(w) (\hat{y}(w) - y)</math><br> | ||
<math> | <math> | ||
\begin{aligned} | \begin{aligned} | ||
| Line 615: | Line 618: | ||
</math> | </math> | ||
If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>. | If we let <math>u = \hat{y} - y</math>, then <math>\frac{du}{dt} \approx -K(w_i) u</math>.<br> | ||
This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>. | This ODE implies <math>u(t) = u(0)\exp(-K(w_i)t)</math>.<br> | ||
In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite). | In the over-parameterized case, <math>K(w_0) > 0 </math> (positive definite). | ||
| Line 2,020: | Line 2,023: | ||
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | ** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | ||
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math> | ** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math> | ||
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math> | ** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. | ||
All eigen values are <math>\Vert \lambda_i \Vert \leq 1</math> with one eigenvalue norm exactly 1. | |||
<math>I - \gamma P_{\pi}</math> is invertible. | This implies <math>I - \gamma P_{\pi}</math> is always invertible. | ||
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>. | |||
===Value-Iteration (Bellman Recursion)=== | |||
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | ||
Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math> | Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> | ||
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. | <math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. | ||
;Claim: <math>L_{\pi}</math> is a contraction | ;Claim: <math>L_{\pi}</math> is a contraction. | ||
Proof: Take <math>v_1, v_2 \in \mathbb{R}^d</math>. | |||
<math> | |||
\begin{aligned} | |||
L_{\pi}v_1 - L_{\pi}v_2 &= (R + \gamma P_{\pi}v_1) - (R + \gamma P_{\pi}v_2) \\ | |||
&= \gamma P_{\pi}(v_1 - v_2) | |||
\end{aligned} | |||
</math> | |||
<math> | |||
\begin{aligned} | |||
\Vert L_{\pi}v_1 - L_{\pi}v_2 \Vert &= \Vert \gamma P_{\pi}(v_1 - v_2) \Vert\\ | |||
& \leq \gamma \Vert P_{\pi} \Vert \Vert v_1 - v_2 \Vert \\ | |||
& \leq \Vert v_1 - v_2 \Vert | |||
\end{aligned} | |||
</math> | |||
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>. | |||
===Optimal Policy=== | |||
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>. | |||
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. | |||
The policy may not be unique but if two policies are equal, they have the same value function. | |||
Intermediate questions: | |||
* Given <math>v^*</math>, can we compute <math>Q^*</math>? | |||
<math> | |||
\begin{aligned} | |||
Q^*(s, a) &= \max_{\pi} Q(s,a)\\ | |||
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\ | |||
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\ | |||
\end{aligned} | |||
</math> | |||
* Given <math>Q^*</math>, can we compute <math>v^*</math>. | |||
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. | |||
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math> | |||
;How to compare optimal policies? | |||
First approach: Value Iteration | |||
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> | |||
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> | |||
<math>v^*</math> is a fixed point of <math>L(v)</math>. | |||
Claim: <math>L</math> is also a contraction. | |||
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. | |||
===Value Iteration=== | |||
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math> | |||
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math> | |||
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math> | |||
===Policy Iteration=== | |||
Directly optimize policy. | |||
* Start with a random policy <math>\pi^{(0)}</math> | |||
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration | |||
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math> | |||
* Find the best action to be taken at state <math>s</math>: | |||
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math> | |||
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math> | |||
* Repeat until convergence. | |||
Policy iteration is guaranteed to converge to an optimal policy. | |||
Oftentimes converges faster than value iteration. | |||
===Deep Reinforcement Learning=== | |||
;Relaxing some unrealistic assumptions | |||
# Evaluate <math>v_{\pi}(s)</math> | |||
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math> | |||
# Improve the policy | |||
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math> | |||
#* Assumption <math>|S|=d</math> | |||
# How to represent <math>V(s)</math>? | |||
#* Can we use a neural network to represent <math>V(s)</math>? Yes | |||
;How to train <math>v_{\phi}</math>? | |||
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. | |||
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> | |||
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: | |||
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> | |||
* Iterate | |||
;Similarly we can parameterize the Q function | |||
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> | |||
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL). | |||
With model-free RL: | |||
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> | |||
This is called ''Q-Learning''. | |||
;What if we have continuous actions: | |||
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve | |||
** [Gu ''et al.'' 2016] use quadratic functions. | |||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | |||
===Policy Gradient Method=== | |||
Lecture 29 (Dec 8, 2020) | |||
Probability of observing a trajectory: | |||
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> | |||
Reward for a trajectory: | |||
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> | |||
The average reward is: | |||
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math> | |||
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>. | |||
Gradient of the average reward: | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\ | |||
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\ | |||
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau \\ | |||
&= \int P_{\theta}(\tau) \nabla_{\theta} \log P_\theta(\tau) R(\tau) d\tau\\ | |||
&= E[\log P_\theta(\tau) R(\tau)] | |||
\end{aligned} | |||
</math> | |||
<math> | |||
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) | |||
</math> | |||
Implies, | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &= E[...]\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) .... | |||
\end{aligned} | |||
</math> | |||
;Summary | |||
* Sample trajectories | |||
* Approximate <math>\nabla_{\theta} J(\theta)</math> | |||
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math> | |||
;Intuition | |||
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> | |||
Formalizing ''trial & error''. | |||
[Finn & Levin, ICML] | |||
===Issues with Policy Gradient=== | |||
;High variance of gradient estimation | |||
;Solutions | |||
* Subtract a baseline | |||
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math> | |||
* Reward-to-go | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
\end{aligned} | |||
</math> | |||
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate. | |||
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math> | |||
===Actor-critic algorithms=== | |||
Have an actor <math>\pi_{\theta}</math>. | |||
Have a critic <math>V_{\phi}/Q</math> | |||
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math> | |||
===Other topics in RL=== | |||
* Inverse RL | |||
* Multi-agent RL | |||
* Model-based RL | |||
==Summary of Course== | |||
;What we covered | |||
* Supervised DL | |||
* Unsupervised DL (GANs, VAEs) | |||
* Self-supervised DL | |||
* Meta-Learning | |||
* Learning with Attention (Transformers) | |||
* Deep RL | |||
* Optimization | |||
* Generalization | |||
* Robustness | |||
* Interpretability | |||
;What we didn't cover | |||
* Fairness | |||
* Privacy & Ethics | |||
* Bayesian DL | |||
* Federated Learning | |||
* Graph NNs | |||
;Things which may be on the final | |||
* Transformers | |||
* Wasserstein distance | |||
* Kernel methods | |||
==Resources== | ==Resources== | ||