Deep Learning: Difference between revisions
Line 2,165: | Line 2,165: | ||
;Intuition | ;Intuition | ||
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> | |||
Formalizing ''trial & error''. | |||
[Finn & Levin, ICML] | |||
;Issues about policy gradient: | |||
* High variance of gradient estimation | |||
;Solutions | |||
* Subtract a baseline | |||
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math> | |||
* Reward-to-go | |||
<math> | |||
\begin{aligned} | |||
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\ | |||
\end{aligned} | |||
</math> | |||
==Misc== | ==Misc== |