Lect5_Model_free_Control

Model Free Control

Optimise the value function of an unknown MDP

On-Policy Monte-Carlo Control

Generalised Policy Iteration

在这里插入图片描述

Monte-Carlo Policy Iteration

ONE

  • Policy evaluation: Monte-Carlo poliy evaluation, V = v π V= v_\pi V=vπ? or Q = q π Q = q_\pi Q=qπ?

  • Policy improvement: Greedy policy improvement?

    1. Greedy policy improvement over V ( s ) V(s) V(s) requires model of MDP (model-based).
      π ′ ( s ) = argmax ⁡ a ∈ A ( R s a + P s s ′ a V ( s ′ ) ) \pi'(s) = \underset{a \in \mathcal{A}}{\operatorname{argmax}}\left(\mathcal{R}_s^a + \mathcal{P}_{ss'}^a V(s') \right) π(s)=aAargmax(Rsa+PssaV(s))

    2. Greedy policy improvement over Q ( s , a ) Q(s,a) Q(s,a) is model-free.
      π ′ ( s ) = argmax ⁡ a ∈ A Q ( s , a ) \pi'(s) = \underset{a \in \mathcal{A}}{\operatorname{argmax}} Q(s,a) π(s)=aAargmaxQ(s,a)

so: ⇓ \Downarrow

TWO

  • Policy evaluation: Monte-Carlo policy evaluation, Q = q π Q = q_\pi Q=qπ

  • Policy improvement: Greedy policy improvement?
    初始化时价值函数一般都是同一个值,若有一个动作比较好,greedy 就会倾向于一直选这个动作,其他动作的好坏并不知道。这就是 exploration 的问题。
    ϵ \epsilon ϵ-Greedy Exploration: all m actions are tried with non-zero probability. With probability 1 − ϵ 1-\epsilon 1ϵ choose the greedy action. With probability ϵ \epsilon ϵ choose an action at random.
    π ( a ∣ s ) = { ϵ / m + 1 − ϵ if  a ∗ = argmax ⁡ a ∈ A Q ( s , a ) ϵ / m otherwise \pi(a \mid s) = \begin{cases} \epsilon/m + 1-\epsilon &\text{if}\ a^* = \underset{a \in \mathcal{A}}{\operatorname{argmax}} Q(s,a) \\ \epsilon/m &\text{otherwise} \end{cases} π(as)=ϵ/m+1ϵϵ/mif a=aAargmaxQ(s,a)otherwise

    Theorem
    For any ϵ \epsilon ϵ-greedy policy π \pi π, the ϵ \epsilon ϵ-greedy policy π ′ \pi' π with respect to q π q_\pi qπ is an improvement, v π ′ ( s ) ≥ v π ( s ) v_{\pi'}(s) \geq v_\pi(s) vπ(s)vπ(s)

    Proof:
    q π ( s , π ′ ( s ) ) = ∑ a ∈ A π ′ ( a ∣ s ) q π ( s , a ) = ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) max ⁡ a ∈ A q π ( s , a ) = ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) max ⁡ a ∈ A q π ( s , a ) ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ ≥ ϵ / m ∑ a ∈ A q π ( s , a ) + ( 1 − ϵ ) ∑ a ∈ A π ( a ∣ s ) − ϵ m 1 − ϵ q π ( s , a ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) = v π ( s ) \begin{aligned} q_\pi(s,\pi'(s)) &= \sum_{a \in \mathcal{A}} \pi'(a\mid s) q_\pi(s,a) \\ &= \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon)\underset{a \in \mathcal{A}}{\operatorname{max}} q_\pi(s,a) \\ &= \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon) \underset{a \in \mathcal{A}}{\operatorname{max}} q_\pi(s,a) \sum_{a\in \mathcal{A}}\frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} \\ &\geq \epsilon/m \sum_{a \in \mathcal{A}} q_\pi(s,a) + (1-\epsilon)\sum_{a\in \mathcal{A}}\frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} q_\pi(s,a) \\ &= \sum_{a \in \mathcal{A}} \pi(a \mid s)q_\pi(s,a) = v_\pi(s) \end{aligned} qπ(s,π(s))=aAπ(as)qπ(s,a)=ϵ/maAqπ(s,a)+(1ϵ)aAmaxqπ(s,a)=ϵ/maAqπ(s,a)+(1ϵ)aAmaxqπ(s,a)aA1ϵπ(as)mϵϵ/maAqπ(s,a)+(1ϵ)aA1ϵπ(as)mϵqπ(s,a)=aAπ(as)qπ(s,a)=vπ(s)
    解释第三行和第四行。第三行根据 ϵ \epsilon ϵ-greedy 的定义,可以算出第二个求和算式的值为1,因此是等于号。第四行将求和的部分当作 q π ( s , a ) q_\pi(s,a) qπ(s,a) 的权值,很容易想象, q π ( s , a ) q_\pi(s,a) qπ(s,a) 每一项加权求和一定小于等于其最大值。

    另外可能的误会: π ( a ∣ s ) − ϵ m 1 − ϵ \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} 1ϵπ(as)mϵ q π ( s , a ) q_\pi(s,a) qπ(s,a) 为最大值时的action计算结果为: π ( a ∣ s ) − ϵ m 1 − ϵ = 1 \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} =1 1ϵπ(as)mϵ=1,其余action计算出来为: π ( a ∣ s ) − ϵ m 1 − ϵ = 0 \frac{\pi(a \mid s) - \frac{\epsilon}{m}}{1-\epsilon} = 0 1ϵπ(as)mϵ=0,因此第四行应该是等于号。如果这么认为的话,就相当于 π \pi π and π ′ \pi' π 没有区别了,要注意 max ⁡ a ∈ A q π ( s , a ) \underset{a \in \mathcal{A}}{\operatorname{max}}q_\pi(s,a) aAmaxqπ(s,a) 中选中的 action 是 π ′ \pi' π 根据 q π ( s , a ) q_\pi(s,a) qπ(s,a) 最大值选取的,这个action与 π ( a ∣ s ) = ϵ / m + 1 − ϵ \pi(a \mid s) = \epsilon/m + 1-\epsilon π(as)=ϵ/m+1ϵ 时的action 并不一定是同一个。 如果是同一个就是等于号了。

so: ⇓ \Downarrow

THREE

在这里插入图片描述

  • Policy evaluation: Monte-Carlo policy evaluation, Q = q π Q = q_\pi Q=qπ
  • Policy improvement: ϵ \epsilon ϵ-Greedy policy improvement
Pseudocode

在这里插入图片描述

Monte-Carlo Control

在这里插入图片描述

Every episode

  • Policy evaluation: Monte-Carlo policy evaluation, Q ≈ q π Q \approx q_\pi Qqπ
  • Policy improvement: ϵ \epsilon ϵ-Greedy policy improvement

不用等很多个episode来估算Q,每一个episode完直接更新

GLIE Monte-Carlo Control

Definition of GLIE:

Greedy in the Limit with Infinite Exploration (GLIE)

  • All state-action pairs are explored infinitely many times
    lim ⁡ k → ∞ N k ( s , a ) = ∞ \lim_{k\to \infty}N_k(s,a) = \infty klimNk(s,a)=

  • The policy converges on a greedy policy
    lim ⁡ k → ∞ π k ( a ∣ s ) = 1 ( a = arg max ⁡ a ′ ∈ A Q k ( s , a ′ ) ) \lim_{k\to \infty} \pi_k(a\mid s) = 1\left(a=\underset{a' \in \mathcal{A}}{\operatorname{arg\,max}}Q_k(s,a') \right) klimπk(as)=1(a=aAargmaxQk(s,a))

Algorithm:

  • Sample kth episode using π : { S 1 , A 1 , R 2 , … , S T } ∼ π \pi: \{S_1, A_1, R_2, \dots, S_T\} \sim \pi π:{S1,A1,R2,,ST}π

  • For each state S t S_t St and action A t A_t At in the episode
    N ( S t , A t ) ← N ( S t , A t ) + 1 Q ( S t , A t ) ← Q ( S t , A t ) + 1 N ( S t , A t ) ( G t − Q ( S t , A t ) ) \begin{aligned} N(S_t,A_t) &\leftarrow N(S_t, A_t) + 1 \\ Q(S_t,A_t) &\leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)}(G_t - Q(S_t,A_t)) \end{aligned} N(St,At)Q(St,At)N(St,At)+1Q(St,At)+N(St,At)1(GtQ(St,At))

  • Improve policy based on new action-value function
    ϵ ← 1 / k 逐渐增大选择使Q最大的action的概率 π ← ϵ -greedy ( Q ) \begin{aligned} \epsilon &\leftarrow 1/k \qquad \qquad \text{逐渐增大选择使Q最大的action的概率} \\ \pi &\leftarrow \epsilon \text{-greedy}(Q) \end{aligned} ϵπ1/k逐渐增大选择使Q最大的action的概率ϵ-greedy(Q)

On-Policy Temporal-Difference Learning

Natural diet: use TD instead of MC in control loop

  • Apply TD to Q ( S , A ) Q(S,A) Q(S,A)
  • Use ϵ \epsilon ϵ-greedy policy improvement
  • Update every time-step

Update Action-Value Functions with Sarsa

在这里插入图片描述

On-Policy Control With Sarsa

在这里插入图片描述

Every time-step:

Policy evaluation Sarsa, Q ≈ q π Q \approx q_\pi Qqπ

Policy improvement ϵ \epsilon ϵ-greedy policy improvement

Sarsa Algorithm:

在这里插入图片描述

Sarsa( λ \lambda λ)

n-step Sarsa

  • Consider the following n-step returns for n = 1 , 2 , … , ∞ n = 1,2,\dots, \infty n=1,2,,

n = 1 (Sarsa)   q t ( 1 ) = R t + 1 + γ Q ( S t + 1 ) n = 2    q t ( 2 ) = R t + 1 + + γ R t + 2 + γ 2 Q ( S t + 1 ) ⋮ n = ∞ (MC)   q t ( ∞ ) = R t + 1 + + γ R t + 2 + ⋯ + γ T − 1 R T \begin{aligned} n=1 \text{(Sarsa)} \quad \ q_t^{(1)} &= R_{t+1} + \gamma Q(S_{t+1}) \\ n=2 \qquad \quad \quad \ \ q_t^{(2)} &= R_{t+1} + +\gamma R_{t+2} + \gamma^2 Q(S_{t+1}) \\ \vdots \\ n=\infty \text{(MC)} \quad \ q_t^{(\infty)} &= R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{T-1}R_T \end{aligned} n=1(Sarsa) qt(1)n=2  qt(2)n=(MC) qt()=Rt+1+γQ(St+1)=Rt+1++γRt+2+γ2Q(St+1)=Rt+1++γRt+2++γT1RT

  • Define the n-step Q-return
    q t ( n ) = R t + 1 + + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n Q ( S t + n ) q_t^{(n)} = R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{n-1}R_{t+n} + \gamma^n Q(S_{t+n}) qt(n)=Rt+1++γRt+2++γn1Rt+n+γnQ(St+n)

  • n-step Sarsa updates Q(s,a) towards the n-step Q-return

Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t ( n ) − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left(q_t^{(n)} - Q(S_t,A_t) \right) Q(St,At)Q(St,At)+α(qt(n)Q(St,At))

Forward View Sarsa( λ \lambda λ)
  • combines all n-step Q-returns q t ( n ) q_t^{(n)} qt(n)

  • Using weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} (1λ)λn1
    q t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 q t ( n ) q_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}q_t^{(n)} qtλ=(1λ)n=1λn1qt(n)

  • Forward-view Sarsa( λ \lambda λ)
    Q ( S t , A t ) ← Q ( S t , A t ) + α ( q t λ − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left(q_t^\lambda - Q(S_t,A_t) \right) Q(St,At)Q(St,At)+α(qtλQ(St,At))

Backward View Sarsa( λ \lambda λ)

参考Lect4中关于TD( λ \lambda λ)的backward-view的部分,这里不详细展开。

  • Just like TD( λ \lambda λ), we use eligibility trace in an online algorithm

  • But Sarsa( λ \lambda λ) has one eligibility trace for each state-action pair
    E 0 ( s , a ) = 0 E t ( s , a ) = γ λ E t − 1 ( s , a ) + 1 ( S t = s , A t = a ) \begin{aligned} E_0(s,a) &= 0 \\ E_t(s,a) &= \gamma \lambda E_{t-1}(s,a) + 1(S_t = s, A_t = a) \end{aligned} E0(s,a)Et(s,a)=0=γλEt1(s,a)+1(St=s,At=a)

  • Q ( s , a ) Q(s,a) Q(s,a) is updated for every state s and action a

  • In proportion to TD-error δ t \delta_t δt and eligibility trace E t ( s , a ) E_t(s,a) Et(s,a)
    δ t = R t + 1 + γ Q ( S t + 1 , A t + 1 ) − Q ( S t , A t ) Q ( s , a ) ← Q ( S , a ) + α δ t E t ( s , a ) \begin{aligned} \delta_t &= R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \\ Q(s,a) &\leftarrow Q(S,a) + \alpha \delta_t E_t(s,a) \end{aligned} δtQ(s,a)=Rt+1+γQ(St+1,At+1)Q(St,At)Q(S,a)+αδtEt(s,a)

Sarsa( λ \lambda λ) Algorithm

在这里插入图片描述

Off-Policy Learning

Importance Sampling

Estimate the expectation of a different distribution:
E X ∼ P [ f ( X ) ] = ∑ P ( X ) f ( X ) = ∑ Q ( X ) P ( X ) Q ( X ) f ( X ) = E X ∼ Q [ P ( X ) Q ( X ) f ( X ) ] \begin{aligned} \mathbb{E}_{X \sim P}[f(X)] &= \sum P(X)f(X) \\ &= \sum Q(X) \frac{P(X)}{Q(X)}f(X) \\ &= \mathbb{E}_{X \sim Q}\left[\frac{P(X)}{Q(X)}f(X) \right] \end{aligned} EXP[f(X)]=P(X)f(X)=Q(X)Q(X)P(X)f(X)=EXQ[Q(X)P(X)f(X)]

Important Sampling for Off-Policy Monte-Carlo

  • Use returns generated from μ \mu μ to evaluate π \pi π

  • Weight return G t G_t Gt according to similarity between policies, multiply improtance sampling corrections along whole episode
    G t π / μ = π ( A t ∣ S t ) μ ( A t ∣ S t ) π ( A t + 1 ∣ S t + 1 ) μ ( A t + 1 ∣ S t + 1 ) … π ( A T ∣ S T ) μ ( A T ∣ S T ) G T G_t^{\pi/\mu} = \frac{\pi(A_t \mid S_t)}{\mu(A_t \mid S_t)} \frac{\pi(A_{t+1} \mid S_{t+1})}{\mu(A_{t+1} \mid S_{t+1})} \dots \frac{\pi(A_T \mid S_T)}{\mu(A_T \mid S_T)} G_T Gtπ/μ=μ(AtSt)π(AtSt)μ(At+1St+1)π(At+1St+1)μ(ATST)π(ATST)GT

  • Update value towards corrected return
    V ( S t ) ← V ( S t ) + α ( G t π / μ − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}G_t^{\pi/\mu}} - V(S_t) \right) V(St)V(St)+α(Gtπ/μV(St))

  • Importance sampling can dramatically increase variance

Important Sampling for Off-Policy TD

  • Use TD targets generated from μ \mu μ to evaluate π \pi π

  • Weight TD target R + γ V ( S ′ ) R + \gamma V(S') R+γV(S) by importance sampling, only need a single improtance sampling correction
    V ( S t ) ← V ( S t ) + α ( π ( A t ∣ S t ) μ ( A t ∣ S t ) ( R t + 1 + γ V ( S t + 1 ) ) − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}{\frac{\pi(A_t \mid S_t)}{\mu(A_t \mid S_t)}\left(R_{t+1} + \gamma V(S_{t+1}) \right)}} - V(S_t) \right) V(St)V(St)+α(μ(AtSt)π(AtSt)(Rt+1+γV(St+1))V(St))

  • Much lower variance than Monte-Carlo improtance sampling

  • Policies only need to be similar over a single step

Off-Policy Q-Learning

在状态 S t S_t St 时,根据 behavior policy 选取 action: A t ∼ μ ( ⋅ ∣ S t ) A_t \sim \mu(\cdot \mid S_t) Atμ(St),得到相应的奖励 R t + 1 R_{t+1} Rt+1,而后达到了状态 S t + 1 S_{t+1} St+1,此时根据 estimate policy 选取action: A t + 1 ∼ π ( ⋅ ∣ S t + 1 ) A_{t+1} \sim \pi(\cdot \mid S_{t+1}) At+1π(St+1)。将这个action记为 A ′ A' A

更新: Q ( S t , A t ) ← Q ( S t , A t ) + α ( R t + 1 + γ Q ( S t + 1 , A ′ ) − Q ( S t , A t ) ) Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left({\color{red}{R_{t+1} + \gamma Q(S_{t+1},A')}}- Q(S_t,A_t) \right) Q(St,At)Q(St,At)+α(Rt+1+γQ(St+1,A)Q(St,At))

Special Case
  • The target policy π \pi π is greedy w.r.t. Q ( s , a ) Q(s,a) Q(s,a)
    π ( S t + 1 ) = arg max ⁡ a ′ Q ( S t + 1 , a ′ ) \pi(S_{t+1}) = \underset{a'}{\operatorname{arg\,max}} Q(S_{t+1},a') π(St+1)=aargmaxQ(St+1,a)

  • The behavior policy μ \mu μ is ϵ \epsilon ϵ-greedy w.r.t. Q ( s , a ) Q(s,a) Q(s,a)

The Q-learning target then simplifies:
R t + 1 + γ Q ( S t + 1 , A ′ ) = R t + 1 + γ Q ( S t + 1 , arg max ⁡ a ′ Q ( S t + 1 , a ′ ) ) = R t + 1 + max ⁡ a ′ γ Q ( S t + 1 , a ′ ) \begin{aligned} R_{t+1} + \gamma Q(S_{t+1}, A') &= R_{t+1} + \gamma Q(S_{t+1}, \underset{a'}{\operatorname{arg\,max}} Q(S_{t+1},a')) \\ &= R_{t+1} + \underset{a'}{\operatorname{max}} \gamma Q(S_{t+1},a') \end{aligned} Rt+1+γQ(St+1,A)=Rt+1+γQ(St+1,aargmaxQ(St+1,a))=Rt+1+amaxγQ(St+1,a)
Algorithm:

在这里插入图片描述

Summary

在这里插入图片描述

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值