5,332
edits
Line 2,025: | Line 2,025: | ||
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>. | However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>. | ||
===Value-Iteration (Bellman Recursion)=== | |||
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | ||
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> | Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math> | ||
Line 2,046: | Line 2,046: | ||
</math> | </math> | ||
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>. | By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>. | ||
===Optimal Policy=== | |||
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>. | |||
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function. | |||
Intermediate questions: | |||
* Given <math>v^*</math>, can we compute <math>Q^*</math>? | |||
==Misc== | ==Misc== |