Deep Learning: Difference between revisions

← Older edit Newer edit →

@@ Line 2,124: / Line 2,124: @@
 ** [Gu ''et al.'' 2016] use quadratic functions.
 * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
+===Lecture (Dec 8)===
+Probability of observing a trajectory:
+<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>
+Reward for a trajectory:
+<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math>
+The average reward is:
+<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
+Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>
 ==Misc==