Jump to content

Deep Learning: Difference between revisions

Line 2,066: Line 2,066:
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.   
Bellman's optimality states that <math>v^*(s) = \max_{a} Q^*(s, a)</math>.   
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
<math>\pi^*(s) = \operatorname*{argmax}_{a} Q^*(s, a)</math>
;How to compare optimal policies?
First approach: Value Iteration
<math>V^*(s) = \max_{a} Q^*(s, a) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^*(s')]</math> 
Define an operator: <math>L(v) = \max_{\pi}[R + \gamma P_{\pi} v]</math> 
<math>v^*</math> is a fixed point of <math>L(v)</math>. 
Claim: <math>L</math> is also a contraction. 
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>. 
Value Iteration:
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>


==Misc==
==Misc==