5,337
edits
Line 2,125: | Line 2,125: | ||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | ||
=== | ===Policy Gradient Method=== | ||
Lecture 29 (Dec 8, 2020) | Lecture 29 (Dec 8, 2020) | ||
Probability of observing a trajectory: | Probability of observing a trajectory: | ||
Line 2,147: | Line 2,147: | ||
\end{aligned} | \end{aligned} | ||
</math> | </math> | ||
<math> | |||
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t) | |||
</math> | |||
Implies, | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &=\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) .... | |||
\end{aligned} | |||
</math> | |||
;Summary | |||
* Sample trajectories | |||
* Approximate <math>\nabla_{\theta} J(\theta)</math> | |||
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math> | |||
;Intuition | |||
==Misc== | ==Misc== |