Deep Learning: Difference between revisions

Deep Learning (view source)

621 bytes added , 3 December 2020

5,337

edits

@@ Line 2,075: / Line 2,075: @@
 Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.
-Value Iteration:
+===Value Iteration===
 * Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
 * <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
 * Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
+===Policy Iteration===
+Directly optimize policy.
+* Start with a random policy <math>\pi^{(0)}</math>
+* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
+* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
+* Find the best action to be taken at state <math>s</math>:
+** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
+* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
+* Repeat until convergence.
+Policy iteration is guaranteed to converge to an optimal policy.
+Oftentimes converges faster than value iteration.
 ==Misc==