Deep Learning: Difference between revisions

Deep Learning (view source)

487 bytes added , 3 December 2020

5,332

edits

@@ Line 2,025: / Line 2,025: @@
 However, computing the inverse is computationally expensive since it scales with <math>|S|^3</math>.
-;Value-Iteration (Bellman Recursion)
+===Value-Iteration (Bellman Recursion)===
 <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
 Define an operator <math>L_{\pi}v = R + \gamma P_{\pi}v</math>
@@ Line 2,046: / Line 2,046: @@
 </math>
 By Banach's Fixed Point Theorem, we can converge to the fixed point iteratively by repeatedly applying <math>L_{\pi}</math>.
+===Optimal Policy===
+Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
+Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function.
+Intermediate questions:
+* Given <math>v^*</math>, can we compute <math>Q^*</math>?
 ==Misc==