Jump to content

Deep Learning: Difference between revisions

Line 2,050: Line 2,050:
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.


Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function.
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>.
The policy may not be unique but if two policies are equal, they have the same value function.


Intermediate questions:
Intermediate questions:
* Given <math>v^*</math>, can we compute <math>Q^*</math>?
* Given <math>v^*</math>, can we compute <math>Q^*</math>?
<math>
\begin{aligned}
Q^*(s, a) &= \max_{\pi} Q(s,a)\\
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
\end{aligned}
</math>
* Given <math>Q^*</math>, can we compute <math>v^*</math>.
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. 
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>


==Misc==
==Misc==