CS234 RL Winter

lecture 1

Intro

  • r r r Reward function, can be a function of $r(s) $or r ( s , a ) r(s,a) r(s,a)
  • s s s, state
  • a a a, action
  • h h h: history
  • γ \gamma γ: discount factor
    • mathematically convenient. γ = 0 \gamma = 0 γ=0: only care about immediate reward
  • π ∗ \pi^* π: optimal policy
  • V k π ( s ) V^{\pi}_k(s) Vkπ(s): state value funciton at step k k k with policy π \pi π with state s s s
  • Q π ( s , a ) Q^\pi (s,a) Qπ(s,a): State-Action Value Q: Take action a, then follow the policy π \pi π
  • model: how world change, given s t s_t st and a t a_t at, stocastic or deterministic
  • model-free: you don’t know how world change, e.x two player game
  • policy: how you react given a state s s s
  • value:
  • expoloration try new things that might be better in the future
  • Expliotation: choose actions that are expected to yield good rewared given past expereince
  • Horizon: the number of actions you can take to reach termination, could be finite or infinite
  • Episode: series of actions from start to end
  • G t G_t Gt discuonted sum of rewards G t = r t + γ r + t + 1 + γ 2 r t + 2 . . . G_t = r_t + \gamma r+{t+1} + \gamma^2 r_{t+2} ... Gt=rt+γr+t+1+γ2rt+2...
  • V ( s ) V(s) V(s) State Value Function, expected return form starting in state s V ( s ) = E [ G t ∣ s t = s ] V(s) = E[G_t | s_t = s] V(s)=E[Gtst=s]
  • O O O: contraction operator. ∣ O V − O V ′ ∣ ≤ ∣ V − V ′ ∣ |OV - OV'| \leq |V - V'| OVOVVV
    lecture 2

Markov (Decision / reward) process

  • Bandit: single state MDP: only care about current state

State s t s_t st is Markov iff
p ( s t + 1 ∣ s t , a t ) = p ( s t + 1 ∣ h t , a t ) p(s_{t+1} | s_t, a_t) = p(s_{t+1} | h_t, a_t) p(st+1st,at)=p(st+1ht,at)

Markov Chain: each state has a fix distribution of the next state
Markov Reward process

  • Markov chain + reward
  • In a n step episode, MRP value function satisifies
    V ( s ) = R ( S ) ⏟ Immediate reward + γ ∑ s ′ ∈ S p ( s ′ ∣ s ) V ( s ′ ) ⏟ Discounted sum of future rewards V(s) = \underbrace{R(S)}_{\text{Immediate reward}} + \underbrace{\gamma \sum _{s' \in S} p(s' | s)V(s')}_\text{Discounted sum of future rewards} V(s)=Immediate reward R(S)+Discounted sum of future rewards γsSp(ss)V(s)
    在这里插入图片描述

Iterative algorithm for computing value of a MRP

在这里插入图片描述

Markov Decision Process

MRP + actions
P is transition model for each action, so taken a state and a action, P ( s t + 1 s = s ′ ∣ s t = s , a t = a ) P(s_{t+1} s= s' | s_t = s, a_t = a) P(st+1s=sst=s,at=a)
The next state is usually not deterministic given a state and a action

在这里插入图片描述

Quiz

Suppose you hvae 7 discrete states and 2 actions, how many deterministic policies are there?

2^7

Is the optimal policy for a MDP always unique?

no

Policy Search

A: action, S: state, try all A S A^S AS possibilities

Policy Iteration

  • Set i = 0
  • Initialize π0(s) randomly for all states s
  • While i == 0 or kπi − πi−1k1 > 0 (L1-norm, measures if the policy
    changed for any state):
    V i π V^π_i Viπ ← MDP V function policy evaluation of πi
    π i + 1 π_{i+1} πi+1 ← Policy improvement
    i = i + 1

Q value

State-action value of a policy
Q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) Q^\pi(s,a) = R(s,a) + \gamma \sum _{s' \in S} P(s'| s, a) v^\pi(s') Qπ(s,a)=R(s,a)+γsSP(ss,a)vπ(s)

Policy Improvement

compute new poliy π i + 1 \pi_{i+1} πi+1, for all s ∈ S s \in S sS
π i + 1 ( s ) = arg max ⁡ a   Q π i ( s , a ) ∀ s ∈ S \pi_{i+1}(s) = \underset{a}{\argmax}~ Q^{\pi_i}(s,a) \forall s \in S πi+1(s)=aargmax Qπi(s,a)sS
note: so this is update policy for all states, not just one state?

Policy Iteration quiz

在这里插入图片描述

Bellman backup operator

B V ( s ) = max ⁡ a R ( s , a ) + γ ∑ s ′ ∈ S p ( s ′ ∣ a ) V ( s ′ ) BV(s) = \underset{a}{\max} R(s,a) + \gamma \sum _{s' \in S} p(s'|a)V(s') BV(s)=amaxR(s,a)+γsSp(sa)V(s)

  • BV yields a value function overal all state s

Value Iteration

Bellman Backup is a contraction operator

对于 V ( s ) , V ′ ( s ) V(s), V'(s) V(s),V(s), 如果两个目前都去取最优的action,a,那么自然而然两者的差距会缩小。

Policy Evaluation with dp

V π ( s ) ≈ E π [ r t + γ V k − 1 ∣ s t = s ] V^\pi(s) \approx \mathbb{E}_\pi [r_t + \gamma V_{k-1} | s_t = s] Vπ(s)Eπ[rt+γVk1st=s]

lecture3

Monte carlo policy evaluatoin

No model

  • (s,a) is not deterministic
  • don’t assume Markov
  • Require episode to terminate
  • If trajectories are all finite, sample set of trajectories & average returns

First-visit Monte Carlo vs Every-visit Monte carlo policy evaluation

first time: only account for reward when you first reach the state in an episode
everytime: acount for all, biased estimator

在这里插入图片描述
This is to calculate the V of s

Temporal Difference Learning for Estimating V (TD learning)

Update immediately rather than wait until the end of episode
在这里插入图片描述

lecture4

Model-free control

  • Model is unknown but can be sampled
  • Model is known but computttionally infeasible to conpute

On/Off policy

  • On policy
    • Learn from following that policy
  • Off policy
    • learn form following different policy

MC for on policy Q

在这里插入图片描述

SARSA algorithm (TD)

  1. Set initial ϵ \epsilon ϵ-greedy policy π \pi π randomly, t = 0, initial state s t = s 0 s_t = s_0 st=s0
  2. Take a t a_t at ~ π ( s t ) \pi(s_t) π(st) // Sample from policy
  3. Observe ( r t , s t + 1 ) (r_t, s_{t+1}) (rt,st+1)
  4. loop
    1. Take action a t + 1 ∼ π ( S t + 1 ) a_{t+1} \sim \pi(S_{t+1}) at+1π(St+1)
    2. Observe ( r t + 1 , s t + 2 ) (r_{t+1}, s_{t+2}) (rt+1,st+2)
    3. Update Q given ( s t , a t , r t , s t + 1 , a t + 1 ) (s_t, a_t, r_t, s_{t+1}, a_{t+1}) (st,at,rt,st+1,at+1)
      1. Q ( s t , a t ) ← Q ( s t , a t ) + ( 1 − α ) ( r t + γ Q ( s t + 1 , a t + 1 ) − Q ( s t , a t ) Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) Q(st,at)Q(st,at)+(1α)(rt+γQ(st+1,at+1)Q(st,at)
    4. π ( s t ) = arg max ⁡ a Q ( s t , a ) \pi(s_t) = \argmax_a Q(s_t, a) π(st)=aargmaxQ(st,a) w. prob 1 - ϵ \epsilon ϵ, else random
    5. t = t+1

Q learning

If you want to take bad action early stages, and gain more later.

GLIE

You can’t satisfy GLIE for example the case helicopter. If you break helicopter you can’t go make and make another decision.

Q-learning

Q ( s t , a t ) ← Q ( s t , a t ) + ( 1 − α ) ( r t + γ max ⁡ a ′ Q ( s t + 1 , a ′ ) − Q ( s t , a t ) Q(s_t, a_t) \leftarrow Q(s_t, a_t) + (1-\alpha)(r_t + \gamma \max_{a'}Q(s_{t+1}, a') - Q(s_t, a_t) Q(st,at)Q(st,at)+(1α)(rt+γmaxaQ(st+1,a)Q(st,at)
在这里插入图片描述
注意这里的a是所有的action选最好的,NARSA是选目前policy的

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值