Deep Learning: Difference between revisions

Deep Learning (view source)

451 bytes added , 8 December 2020

5,337

edits

@@ Line 2,125: / Line 2,125: @@
 * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
-===Training using Gradient Descent/Ascent===
+===Policy Gradient Method===
 Lecture 29 (Dec 8, 2020)
 Probability of observing a trajectory:
@@ Line 2,147: / Line 2,147: @@
 \end{aligned}
 </math>
+<math>
+\nabla_{\theta} \log P_{\theta}(\tau) = \sum_{t=1}^{\tau} \nabla_{\theta} \log \pi_{\theta}(a_t | s_t)
+</math>
+Implies,
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &=\\
+&\approx \frac{1}{N} \sum_{i=1}^{N}(\sum \nabla_{\theta} \log \pi(a_t^{(i)} | s_t^{(i)}) ....
+\end{aligned}
+</math>
+;Summary
+* Sample trajectories
+* Approximate <math>\nabla_{\theta} J(\theta)</math>
+* <math>\theta \leftarrow \theta + \alpha \nabla J(\theta)</math>
+;Intuition
 ==Misc==