Jump to content

Deep Learning: Difference between revisions

Line 2,025: Line 2,025:
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.


;Value-Iteration (Bellman Recursion)
===Value-Iteration (Bellman Recursion)===
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>   
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>   
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>   
Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>   
Line 2,046: Line 2,046:
</math>   
</math>   
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
===Optimal Policy===
Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function.
Intermediate questions:
* Given <math>v^*</math>, can we compute <math>Q^*</math>?


==Misc==
==Misc==