Deep Learning: Difference between revisions

Deep Learning (view source)

408 bytes added , 3 December 2020

5,337

edits

@@ Line 2,050: / Line 2,050: @@
 Elementwise max: <math>\max_{\pi} v_{\pi}(s)</math> leads to the optimal policy <math>\pi^*(s)</math>.
-Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>. The policy may not be unique but if two policies are equal, they have the same value function.
+Bellman (1957) showed that for an MDP, there exists an optimal policy that is deterministic and such that <math>v_{\pi^*}(s) \geq v_{\pi}(s)</math> for all <math>s, \pi</math>.
+The policy may not be unique but if two policies are equal, they have the same value function.
 Intermediate questions:
 * Given <math>v^*</math>, can we compute <math>Q^*</math>?
+<math>
+\begin{aligned}
+Q^*(s, a) &= \max_{\pi} Q(s,a)\\
+&= \max_{\pi} R(s) + \gamma \sum_{s'} p(s' | s,a) v_{\pi}(s')\\
+&= R(s) + \gamma \sum_{s'} P(s' | s, a) \max_{\pi}(v_{\pi}(s'))\\
+\end{aligned}
+</math>
+* Given <math>Q^*</math>, can we compute <math>v^*</math>.
+Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.
+<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
 ==Misc==