5,332
edits
Line 2,169: | Line 2,169: | ||
[Finn & Levin, ICML] | [Finn & Levin, ICML] | ||
===Issues with Policy Gradient=== | |||
;High variance of gradient estimation | |||
;Solutions | ;Solutions | ||
Line 2,184: | Line 2,184: | ||
\end{aligned} | \end{aligned} | ||
</math> | </math> | ||
;Some parameters can change <math>\pi_{\theta}</math> more than others so it's hard to choose a fixed learning rate. | |||
Use natural policy gradient: <math>\theta' \leftarrow \theta - \eta F^{-1}\nabla L(\theta)</math> | |||
===Actor-critic algorithms=== | |||
Have an actor <math>\pi_{\theta}</math>. | |||
Have a critic <math>V_{\phi}/Q</math> | |||
<math>\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t^{(i)} | s_t^{(i)}) \left(Q(s_t^{(i)}, a_t^{(i)} - V(s_t^{(i)})\right)</math> | |||
==Misc== | ==Misc== |