Deep Learning: Difference between revisions

Line 2,165: Line 2,165:


;Intuition
;Intuition
<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math> 
Formalizing ''trial & error''. 
[Finn & Levin, ICML]
;Issues about policy gradient:
* High variance of gradient estimation
;Solutions
* Subtract a baseline
<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
* Reward-to-go
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
\end{aligned}
</math>


==Misc==
==Misc==