5,337
edits
Line 2,075: | Line 2,075: | ||
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. | Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. | ||
Value Iteration | ===Value Iteration=== | ||
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math> | * Start with a random <math>v^{(0)} \in \mathbb{R}^d</math> | ||
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math> | * <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math> | ||
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math> | * Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math> | ||
===Policy Iteration=== | |||
Directly optimize policy. | |||
* Start with a random policy <math>\pi^{(0)}</math> | |||
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration | |||
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math> | |||
* Find the best action to be taken at state <math>s</math>: | |||
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math> | |||
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math> | |||
* Repeat until convergence. | |||
Policy iteration is guaranteed to converge to an optimal policy. | |||
Oftentimes converges faster than value iteration. | |||
==Misc== | ==Misc== |