5,332
edits
Line 2,066: | Line 2,066: | ||
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. | Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>. | ||
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math> | <math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math> | ||
;How to compare optimal policies? | |||
First approach: Value Iteration | |||
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> | |||
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> | |||
<math>v^*</math> is a fixed point of <math>L(v)</math>. | |||
Claim: <math>L</math> is also a contraction. | |||
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. | |||
Value Iteration: | |||
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math> | |||
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math> | |||
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math> | |||
==Misc== | ==Misc== |