Jump to content

Deep Learning: Difference between revisions

Line 2,125: Line 2,125:
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>


===Training using Gradient Descent/Ascent===
===Policy Gradient Method===
Lecture 29 (Dec 8, 2020)
Lecture 29 (Dec 8, 2020)


Probability of observing a trajectory:   
Probability of observing a trajectory:   
Line 2,147: Line 2,147:
\end{aligned}
\end{aligned}
</math>
</math>
<math>
\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
</math> 
Implies, 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &=\\
&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
\end{aligned}
</math>
;Summary
* Sample trajectories
* Approximate <math>\nabla_{\theta} J(\theta)</math>
* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
;Intuition


==Misc==
==Misc==