文章目录
Intro
- r r r Reward function, can be a function of $r(s) $or r ( s , a ) r(s,a) r(s,a)
- s s s, state
- a a a, action
- h h h: history
-
γ
\gamma
γ: discount factor
- mathematically convenient. γ = 0 \gamma = 0 γ=0: only care about immediate reward
- π ∗ \pi^* π∗: optimal policy
- V k π ( s ) V^{\pi}_k(s) Vkπ(s): state value funciton at step k k k with policy π \pi π with state s s s
- Q π ( s , a ) Q^\pi (s,a) Qπ(s,a): State-Action Value Q: Take action a, then follow the policy π \pi π
- model: how world change, given s t s_t st and a t a_t at, stocastic or deterministic
- model-free: you don’t know how world change, e.x two player game
- policy: how you react given a state s s s
- value:
- expoloration try new things that might be better in the future
- Expliotation: choose actions that are expected to yield good rewared given past expereince
- Horizon: the number of actions you can take to reach termination, could be finite or infinite
- Episode: series of actions from start to end
- G t G_t Gt discuonted sum of rewards G t = r t + γ r + t + 1 + γ 2 r t + 2 . . . G_t = r_t + \gamma r+{t+1} + \gamma^2 r_{t+2} ... Gt=rt+γr+t+1+γ2rt+2...
- V ( s ) V(s) V(s) State Value Function, expected return form starting in state s V ( s ) = E [ G t ∣ s t = s ] V(s) = E[G_t | s_t = s] V(s)=E[Gt∣st=s]
-
O
O
O: contraction operator.
∣
O
V
−
O
V
′
∣
≤
∣
V
−
V
′
∣
|OV - OV'| \leq |V - V'|
∣OV−OV′∣≤∣V−V′∣
lecture 2
Markov (Decision / reward) process
- Bandit: single state MDP: only care about current state
State
s
t
s_t
st is Markov iff
p
(
s
t
+
1
∣
s
t
,
a
t
)
=
p
(
s
t
+
1
∣
h
t
,
a
t
)
p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_t, a_t)
p(st+1∣st,at)=p(st+1∣ht,at)
Markov Chain: each state has a fix distribution of the next state
Markov Reward process
- Markov chain + reward
- In a n step episode, MRP value function satisifies
V ( s ) = R ( S ) ⏟ Immediate reward + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) ⏟ Discounted sum of future rewards V(s) = \underbrace{R(S)}_{\text{Immediate reward}} + \underbrace{\gamma \sum _{s' \in S} p(s' | s)V(s')}_\text{Discounted sum of future rewards} V(s)=Immediate reward R(S)+Discounted sum of future rewards γs′∈S∑p(s′∣s)V(s′)
Iterative algorithm for computing value of a MRP
Markov Decision Process
MRP + actions
P is transition model for each action, so taken a state and a action,
P
(
s
t
+
1
s
=
s
′
∣
s
t
=
s
,
a
t
=
a
)
P(s_{t+1} s= s' | s_t = s, a_t = a)
P(st+1s=s′∣st=s,at=a)
The next state is usually not deterministic given a state and a action
Quiz
Suppose you hvae 7 discrete states and 2 actions, how many deterministic policies are there?
2^7
Is the optimal policy for a MDP always unique?
no
Policy Search
A: action, S: state, try all A S A^S AS possibilities
Policy Iteration
- Set i = 0
- Initialize π0(s) randomly for all states s
- While i == 0 or kπi − πi−1k1 > 0 (L1-norm, measures if the policy
changed for any state):
V i π V^π_i Viπ ← MDP V function policy evaluation of πi
π i + 1 π_{i+1} πi+1 ← Policy improvement
i = i + 1
Q value
State-action value of a policy
Q
π
(
s
,
a
)
=
R
(
s
,
a
)
+
γ
∑
s
′
∈
S
P
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
Q^\pi(s,a) = R(s,a) + \gamma \sum _{s' \in S} P(s'| s, a) v^\pi(s')
Qπ(s,a)=R(s,a)+γs′∈S∑P(s′∣s,a)vπ(s′)
Policy Improvement
compute new poliy
π
i
+
1
\pi_{i+1}
πi+1, for all
s
∈
S
s \in S
s∈S
π
i
+
1
(
s
)
=
arg max
a
Q
π
i
(
s
,
a
)
∀
s
∈
S
\pi_{i+1}(s) = \underset{a}{\argmax}~ Q^{\pi_i}(s,a) \forall s \in S
πi+1(s)=aargmax Qπi(s,a)∀s∈S
note: so this is update policy for all states, not just one state?
Policy Iteration quiz
Bellman backup operator
B V ( s ) = max a R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ a ) V ( s ′ ) BV(s) = \underset{a}{\max} R(s,a) + \gamma \sum _{s' \in S} p(s'|a)V(s') BV(s)=amaxR(s,a)+γ∑s′∈Sp(s′∣a)V(s′)
- BV yields a value function overal all state s
Value Iteration
Bellman Backup is a contraction operator
对于 V ( s ) , V ′ ( s ) V(s), V'(s) V(s),V′(s), 如果两个目前都去取最优的action,a,那么自然而然两者的差距会缩小。
Policy Evaluation with dp
V π ( s ) ≈ E π [ r t + γ V k − 1 ∣ s t = s ] V^\pi(s) \approx \mathbb{E}_\pi [r_t + \gamma V_{k-1} | s_t = s] Vπ(s)≈Eπ[rt+γVk−1∣st=s]
lecture3
Monte carlo policy evaluatoin
No model
- (s,a) is not deterministic
- don’t assume Markov
- Require episode to terminate
- If trajectories are all finite, sample set of trajectories & average returns
First-visit Monte Carlo vs Every-visit Monte carlo policy evaluation
first time: only account for reward when you first reach the state in an episode
everytime: acount for all, biased estimator
This is to calculate the V of s
Temporal Difference Learning for Estimating V (TD learning)
Update immediately rather than wait until the end of episode
lecture4
Model-free control
- Model is unknown but can be sampled
- Model is known but computttionally infeasible to conpute
On/Off policy
- On policy
- Learn from following that policy
- Off policy
- learn form following different policy
MC for on policy Q
SARSA algorithm (TD)
- Set initial ϵ \epsilon ϵ-greedy policy π \pi π randomly, t = 0, initial state s t = s 0 s_t = s_0 st=s0
- Take a t a_t at ~ π ( s t ) \pi(s_t) π(st) // Sample from policy
- Observe ( r t , s t + 1 ) (r_t, s_{t+1}) (rt,st+1)
- loop
- Take action a t + 1 ∼ π ( S t + 1 ) a_{t+1} \sim \pi(S_{t+1}) at+1∼π(St+1)
- Observe ( r t + 1 , s t + 2 ) (r_{t+1}, s_{t+2}) (rt+1,st+2)
- Update Q given
(
s
t
,
a
t
,
r
t
,
s
t
+
1
,
a
t
+
1
)
(s_t, a_t, r_t, s_{t+1}, a_{t+1})
(st,at,rt,st+1,at+1)
- Q ( s t , a t ) ← Q ( s t , a t ) + ( 1 − α ) ( r t + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) Q(st,at)←Q(st,at)+(1−α)(rt+γQ(st+1,at+1)−Q(st,at)
- π ( s t ) = arg max a Q ( s t , a ) \pi(s_t) = \argmax_a Q(s_t, a) π(st)=aargmaxQ(st,a) w. prob 1 - ϵ \epsilon ϵ, else random
- t = t+1
Q learning
If you want to take bad action early stages, and gain more later.
GLIE
You can’t satisfy GLIE for example the case helicopter. If you break helicopter you can’t go make and make another decision.
Q-learning
Q
(
s
t
,
a
t
)
←
Q
(
s
t
,
a
t
)
+
(
1
−
α
)
(
r
t
+
γ
max
a
′
Q
(
s
t
+
1
,
a
′
)
−
Q
(
s
t
,
a
t
)
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma \max_{a'}Q(s_{t+1}, a') - Q(s_t, a_t)
Q(st,at)←Q(st,at)+(1−α)(rt+γmaxa′Q(st+1,a′)−Q(st,at)
注意这里的a是所有的action选最好的,NARSA是选目前policy的