Deep Learning: Difference between revisions
Line 2,125: | Line 2,125: | ||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | ||
===Lecture (Dec 8) | ===Training using Gradient Descent/Ascent=== | ||
Lecture 29 (Dec 8, 2020) | |||
Probability of observing a trajectory: | Probability of observing a trajectory: | ||
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> | <math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> | ||
Line 2,133: | Line 2,135: | ||
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math> | <math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math> | ||
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math> | Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>. | ||
Gradient of the average reward: | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\ | |||
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\ | |||
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau | |||
\end{aligned} | |||
</math> | |||
==Misc== | ==Misc== |