Jump to content

Deep Learning: Difference between revisions

Line 2,075: Line 2,075:
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.   
Thus <math>v^*</math> can be computed by repeated application of <math>L</math>.   


Value Iteration:
===Value Iteration===
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
* Start with a random <math>v^{(0)} \in \mathbb{R}^d</math>
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
* <math>v^{(r+1)}(s) = \max_{a} [R(s) + \gamma \sum_{s'} p(s'|s,a) v^{r}(s')]</math>
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
* Repeat until <math>\Vert v^{(r+1)} - v^{(r)} \Vert \leq \epsilon</math>
===Policy Iteration===
Directly optimize policy.
* Start with a random policy <math>\pi^{(0)}</math>
* Evaluate the policy <math>v_{\pi}</math> using closed form or value iteration
* Evaluate the Q-Function <math>Q_{\pi}(s, a) = R(s) + \gamma \sum_{s'} P(s'|s,a) V_{\pi}(s')</math>
* Find the best action to be taken at state <math>s</math>:
** <math>a^*(s) = \operatorname{argmax}_{a} Q_{\pi}(s, a)</math>
* Update <math>\pi</math> using <math>\pi(s) = a^*(s)</math>
* Repeat until convergence.
Policy iteration is guaranteed to converge to an optimal policy. 
Oftentimes converges faster than value iteration.


==Misc==
==Misc==