Deep Learning: Difference between revisions

Deep Learning (view source)

Revision as of 17:16, 1 December 2020

2,680 bytes added , 1 December 2020

→‎Reinforcement Learning

David

Bureaucrats, Interface administrators, Administrators

5,337

edits

@@ Line 1,971: / Line 1,971: @@
 ==Reinforcement Learning==
 Lecture 27 (December 1, 2020)
+;Goal: Sequential decision making problems
+* Current decisions can affect future decisions
+Examples: games, robotics, self-driving, finance
+Agent takes an action <math>a_t</math> in the environment which leads to a new state <math>\delta_{t+1}</math>.
+;Definitions
+* <math>S</math> is the state space. This can be discrete or continuous.
+* <math>O</math> are the observations.
+* <math>A</math> is the set of actions which can be discrete or continuous.
+* <math>P</math> are the transition probabilities which model the environment.
+Example:
+* <math>|S|</math> = 2
+* <math>|A|</math> = 2
+In general, <math>S_t \sim P(S_t=s | s_0, a_0, s_1, a_1, ..., a_{t-1})</math>.
+The given variables are the trajectory <math>T_{t-1}</math>.
+Modeling this can be very complex for large <math>T</math>.
+Thus, we apply a markov assumption:
+* <math>S_t \sim P(S_t=s | a_{t-1}, s_{t-1})</math>
+Objective: To maximize the rewards
+* <math>R_t \stackrel{\triangle}{=} R(s_t, a_t)</math>
+Two regimes for rewards
+* Finite horizon; T is finite
+* Infinite horizon; T goes to infinity
+* Discount factor
+===Classical RL===
+Finite Markov Decision Process (MDP)
+* S: Finite number of states
+* A: Finite number of actions
+* <math>R_t = R(a_t, S_t)</math>
+* r discount factor
+Goal: Choose actions to maximize the total rewards.
+* Policy function <math>\pi</math> determines how actions should be taken.
+* The policy function could yield a dist on actions or can be deterministic. We will focus on deterministic actions.
+* Assume this is time invariant.
+* Value function determines the expected reward if we start at state s and follow policy <math>\pi</math>.
+* Q function (action-value function)
+** <math>Q_\pi (s, a) = E[R_0 + \gamma R_1+... | A_0=a, S_0=s, \pi]</math>
+* If we have <math>\pi</math>, can we evaluate <math>V_{\pi}(s)</math>?
+** <math>V_{\pi}(s) = E[R_0 + \gamma R_1 + ... | S_0 = s, \pi]</math>
+* Use Bellman's equation
+** <math>V_{\pi}(s) = R_0 + \gamma \sum_{s'} P_{\pi}(s' | s) V_{\pi}(s')</math>
+** <math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
+** <math>V_{\pi} = (I - \gamma P_{\pi})^{-1}R</math>
+** <math>P_{\pi}</math> is a stochastic matrix. It is not symmetric so eigenvalues can be complex. <math>\Vert \lambda_i \Vert \leq 1</math>
+<math>I - \gamma P_{\pi}</math> is invertible.
+This is computationally expensive since it scales with <math>|S|^3</math>
+;Value-Iteration (Bellman Recursion)
+<math>V_{\pi} = R + \gamma P_{\pi} V_{\pi}</math>
+Define an operator <math>L_\pi}v = R + \gamma P_{\pi}v</math>
+<math>v_pi = L_{\pi} v_{\pi}</math> so <math>v_{\pi}</math> is a fixed point to <math>L_{\pi}</math>.
+;Claim: <math>L_{\pi}</math> is a contraction
 ==Misc==