5,337
edits
Line 1,971: | Line 1,971: | ||
==Reinforcement Learning== | ==Reinforcement Learning== | ||
Lecture 27 (December 1, 2020) | Lecture 27 (December 1, 2020) | ||
;Goal: Sequential decision making problems | |||
* Current decisions can affect future decisions | |||
Examples: games, robotics, self-driving, finance | |||
Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>. | |||
;Definitions | |||
* <math>S</math> is the state space. This can be discrete or continuous. | |||
* <math>O</math> are the observations. | |||
* <math>A</math> is the set of actions which can be discrete or continuous. | |||
* <math>P</math> are the transition probabilities which model the environment. | |||
Example: | |||
* <math>|S|</math> = 2 | |||
* <math>|A|</math> = 2 | |||
In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>. | |||
The given variables are the trajectory <math>T_{t-1}</math>. | |||
Modeling this can be very complex for large <math>T</math>. | |||
Thus, we apply a markov assumption: | |||
* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math> | |||
Objective: To maximize the rewards | |||
* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math> | |||
Two regimes for rewards | |||
* Finite horizon; T is finite | |||
* Infinite horizon; T goes to infinity | |||
* Discount factor | |||
===Classical RL=== | |||
Finite Markov Decision Process (MDP) | |||
* S: Finite number of states | |||
* A: Finite number of actions | |||
* <math>R_t = R(a_t, S_t)</math> | |||
* r discount factor | |||
Goal: Choose actions to maximize the total rewards. | |||
* Policy function <math>\pi</math> determines how actions should be taken. | |||
* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions. | |||
* Assume this is time invariant. | |||
* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>. | |||
* Q function (action-value function) | |||
** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math> | |||
* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>? | |||
** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math> | |||
* Use Bellman's equation | |||
** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math> | |||
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | |||
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math> | |||
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math> | |||
<math>I - \gamma P_{\pi}</math> is invertible. | |||
This is computationally expensive since it scales with <math>|S|^3</math> | |||
;Value-Iteration (Bellman Recursion) | |||
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> | |||
Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math> | |||
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. | |||
;Claim: <math>L_{\pi}</math> is a contraction | |||
==Misc== | ==Misc== |