5,332
edits
Line 2,093: | Line 2,093: | ||
Policy iteration is guaranteed to converge to an optimal policy. | Policy iteration is guaranteed to converge to an optimal policy. | ||
Oftentimes converges faster than value iteration. | Oftentimes converges faster than value iteration. | ||
===Deep Reinforcement Learning=== | |||
;Relaxing some unrealistic assumptions | |||
# Evaluate <math>v_{\pi}(s)</math> | |||
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math> | |||
# Improve the policy | |||
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math> | |||
#* Assumption <math>|S|=d</math> | |||
# How to represent <math>V(s)</math>? | |||
#* Can we use a neural network to represent <math>V(s)</math>? Yes | |||
;How to train <math>v_{\phi}</math>? | |||
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. | |||
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> | |||
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: | |||
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> | |||
* Iterate | |||
;Similarly we can parameterize the Q function | |||
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> | |||
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL). | |||
With model-free RL: | |||
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> | |||
This is called ''Q-Learning''. | |||
;What if we have continuous actions: | |||
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve | |||
** [Gu ''et al.'' 2016] use quadratic functions. | |||
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math> | |||
==Misc== | ==Misc== |