Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 2,165: / Line 2,165: @@
 ;Intuition
+<math>E[\nabla_{\theta} \log P_{\theta}(\tau) R(\tau)]</math>
+Formalizing ''trial & error''.
+[Finn & Levin, ICML]
+;Issues about policy gradient:
+* High variance of gradient estimation
+;Solutions
+* Subtract a baseline
+<math>b = \frac{1}{N} \sum_{i=1}^{N} R(\tau^{(i)})</math>
+* Reward-to-go
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &\approx \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} \log P_{\theta}(\tau) R(\tau)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \left(\sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \right) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&= \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=1}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+&\approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T} \nabla_{\theta} \pi_{\theta}(a_t^{(i)}| s_t^{(i)}) \left(\sum_{t'=t}^T R(s_{t'}^{(i)}, a_{t'}^{(i)})\right)\\
+\end{aligned}
+</math>
 ==Misc==