Deep Learning: Difference between revisions

Deep Learning (view source)

302 bytes added , 8 December 2020

5,332

edits

@@ Line 2,125: / Line 2,125: @@
 * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
-===Lecture (Dec 8)===
+===Training using Gradient Descent/Ascent===
+Lecture 29 (Dec 8, 2020)
 Probability of observing a trajectory:
 <math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>
@@ Line 2,133: / Line 2,135: @@
 <math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
-Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>
+Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
+Gradient of the average reward:
+<math>
+\begin{aligned}
+\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
+&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
+&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau
+\end{aligned}
+</math>
 ==Misc==