Jump to content

Deep Learning: Difference between revisions

1,446 bytes added ,  3 December 2020
Line 2,093: Line 2,093:
Policy iteration is guaranteed to converge to an optimal policy.   
Policy iteration is guaranteed to converge to an optimal policy.   
Oftentimes converges faster than value iteration.
Oftentimes converges faster than value iteration.
===Deep Reinforcement Learning===
;Relaxing some unrealistic assumptions
# Evaluate <math>v_{\pi}(s)</math>
#* <math>Q_{\pi}(s, a) = R(s, a) + \gamma E_{s' \sim P(s'|s,a)} v_{\pi}(s')</math>
# Improve the policy
#* <math>\operatorname{argmax}_{a_t} Q_{\pi}(s_t, a_t)</math>
#* Assumption <math>|S|=d</math>
# How to represent <math>V(s)</math>?
#* Can we use a neural network to represent <math>V(s)</math>? Yes
;How to train <math>v_{\phi}</math>?
* Start with an old <math>v_{\phi}</math>, compute <math>Q_{\pi}(s,a)</math>. 
** <math>Q_{\pi}(s,a) = R(s,a) + \gamma E[v_{\phi}^{old}(s')]</math> 
* Fit <math>v_{\phi}</math> to <math>\max_{a}Q_{\pi}(s,a)</math> using a quadratic loss: 
** <math>L(\phi) = \frac{1}{2} \Vert V_{\phi}(s) - \max_{a} Q_{\pi}(s,a) \Vert^2</math> 
* Iterate
;Similarly we can parameterize the Q function
Compare target <math>y_i \leftarrow R(s_i, a_i) + \gamma E[v_{\phi}(s_i')]</math> 
We need to know transition probabilities <math>P(s'|s,a)</math> to compute the expectation (model-based RL).
With model-free RL: 
We can approximate as <math>E[v(s_i')] \approx v(s_i') = \max_a Q(s_i', a')</math> 
This is called ''Q-Learning''.
;What if we have continuous actions:
* Approach 1: Use a function class such that <math>\max_{a}Q(s,a)</math> is easy to solve
** [Gu ''et al.'' 2016] use quadratic functions.
* Approach 2: Learn another network to approximate the maximizer: <math>\max_{a} Q(s,a')</math>


==Misc==
==Misc==