Jump to content

Deep Learning: Difference between revisions

2,680 bytes added ,  1 December 2020
Line 1,971: Line 1,971:
==Reinforcement Learning==
==Reinforcement Learning==
Lecture 27 (December 1, 2020)
Lecture 27 (December 1, 2020)
;Goal: Sequential decision making problems
* Current decisions can affect future decisions
Examples: games, robotics, self-driving, finance
Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>. 
;Definitions
* <math>S</math> is the state space. This can be discrete or continuous.
* <math>O</math> are the observations.
* <math>A</math> is the set of actions which can be discrete or continuous.
* <math>P</math> are the transition probabilities which model the environment.
Example:
* <math>|S|</math> = 2
* <math>|A|</math> = 2
In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>. 
The given variables are the trajectory <math>T_{t-1}</math>. 
Modeling this can be very complex for large <math>T</math>. 
Thus, we apply a markov assumption:
* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math>
Objective: To maximize the rewards
* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math>
Two regimes for rewards
* Finite horizon; T is finite
* Infinite horizon; T goes to infinity
* Discount factor
===Classical RL===
Finite Markov Decision Process (MDP)
* S: Finite number of states
* A: Finite number of actions
* <math>R_t = R(a_t, S_t)</math>
* r discount factor
Goal: Choose actions to maximize the total rewards. 
* Policy function <math>\pi</math> determines how actions should be taken.
* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions.
* Assume this is time invariant.
* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>.
* Q function (action-value function)
** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math>
* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>?
** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math>
* Use Bellman's equation
** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math>
** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math>
<math>I - \gamma P_{\pi}</math> is invertible. 
This is computationally expensive since it scales with <math>|S|^3</math>
;Value-Iteration (Bellman Recursion)
<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math> 
Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math> 
<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>. 
;Claim: <math>L_{\pi}</math> is a contraction


==Misc==
==Misc==