Jump to content

Deep Learning: Difference between revisions

Line 2,124: Line 2,124:
** [Gu ''et al.'' 2016] use quadratic functions.
** [Gu ''et al.'' 2016] use quadratic functions.
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
===Lecture (Dec 8)===
Probability of observing a trajectory: 
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math> 
Reward for a trajectory: 
<math>R(\tau_1) = R(s_1, a_1) + R(s_2, a_2) + ... + R(s_t, a_t)</math> 
The average reward is: 
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>


==Misc==
==Misc==