Deep Learning: Difference between revisions
Line 2,124: | Line 2,124: | ||
** [Gu ''et al.'' 2016] use quadratic functions. | ** [Gu ''et al.'' 2016] use quadratic functions. | ||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | * Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | ||
===Lecture (Dec 8)=== | |||
Probability of observing a trajectory: | |||
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> | |||
Reward for a trajectory: | |||
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> | |||
The average reward is: | |||
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math> | |||
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math> | |||
==Misc== | ==Misc== |