Deep Learning: Difference between revisions

Line 2,125: Line 2,125:
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>


===Lecture (Dec 8)===
===Training using Gradient Descent/Ascent===
Lecture 29 (Dec 8, 2020)
 
Probability of observing a trajectory:   
Probability of observing a trajectory:   
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>   
<math>P_{\theta}(\tau) = P(s_1) \prod_{t=1}^{\tau} \pi_{\theta}(a_t | s_t) P(s_{t+1} | s_t, a_t)</math>   
Line 2,133: Line 2,135:
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>
<math>J(\theta) = E[R(\tau)] = \sum E[R(s_t, a_t)]</math>


Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>
Our goal is to maximize the average reward: <math>\max_{\theta} J(\theta)</math>.
 
Gradient of the average reward: 
<math>
\begin{aligned}
\nabla_{\theta} J(\theta) &= \nabla_{\theta} E[R(\tau)] \\
&= \nabla_{\theta} \int P_{\theta}(\tau) R(\tau) d\tau \\
&= \int \nabla_{\theta} P_{\theta}(\tau) R(\tau) d\tau
\end{aligned}
</math>


==Misc==
==Misc==