5,337
edits
Line 2,050: | Line 2,050: | ||
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>. | Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>. | ||
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function. | Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. | ||
The policy may not be unique but if two policies are equal, they have the same value function. | |||
Intermediate questions: | Intermediate questions: | ||
* Given <math>v^*</math>, can we compute <math>Q^*</math>? | * Given <math>v^*</math>, can we compute <math>Q^*</math>? | ||
<math> | |||
\begin{aligned} | |||
Q^*(s, a) &= \max_{\pi} Q(s,a)\\ | |||
&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\ | |||
&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\ | |||
\end{aligned} | |||
</math> | |||
* Given <math>Q^*</math>, can we compute <math>v^*</math>. | |||
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. | |||
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math> | |||
==Misc== | ==Misc== |