Lect4_MC_TD_Model_free_prediction

Model-Free Prediction

Estimate the value function of an unknown MDP

Mento-Carlo Learning

Feature:

  • learn directly from episodes of experience
  • model-free: no knowledge of MDP transitions / rewards
  • learns from complete episodes: no bootstrapping
  • uses the simplest possible idea: value = mean return
  • can only apply MC to episodic MDPs: all episodes must terminate

Monte-Carlo Policy Evaluation

Goal: learn v ⁡ π \operatorname{v}_\pi vπ​ from episodes of experience under policy π \pi π
S 1 , A 1 , R 2 , … , S k ∼ π S_1, A_1, R_2, \dots, S_k \sim \pi S1,A1,R2,,Skπ
the difinition of value funtion: V ⁡ π = E [ G t ∣ S t = s ] \operatorname{V}_\pi = \mathbb{E}\left[G_t \mid S_t =s\right] Vπ=E[GtSt=s]​​, but Monte-Carlo policy evaluation uses empirical mean return instead of expected return.

First-Visit Monte-Carlo Policy Evaluation

Algorithm:

  1. To evaluate state S
  2. The first time-step t that state s is visited in an episode,
  3. 增加计数: N ( s ) ← N ( s ) + 1 N(s) \leftarrow N(s) + 1 N(s)N(s)+1
  4. 增加总回报return: S ( s ) ← S ( s ) + G t S(s) \leftarrow S(s) + G_t S(s)S(s)+Gt
  5. Value is estimated by mean return V ( s ) = S ( s ) N ( s ) V(s) = \frac{S(s)}{N(s)} V(s)=N(s)S(s)
  6. By law of arge numbers, $V(s) \rightarrow \operatorname{v}_\pi(s)\ as\ N(s) \rightarrow \infty $​​ (大数定律)

Incremental Mento-Carlo

foundation:

The mean μ 1 , μ 2 , … \mu_1,\mu_2, \dots μ1,μ2,​ of a sequence x 1 , x 2 , … x_1,x_2, \dots x1,x2,​ can be computed incrementally,
μ k = 1 k ∑ j = 1 k x j = 1 k ( x k + ∑ j = 1 k − 1 x j ) = 1 k ( x k + ( k − 1 ) μ k − 1 ) = μ k − 1 + 1 k ( x k − μ k − 1 ) \begin{aligned} \mu_k &= \frac{1}{k} \sum_{j=1}^k x_j \\ &= \frac{1}{k} \left(x_k + \sum_{j=1}^{k-1} x_j \right) \\ &= \frac{1}{k} \left(x_k + (k-1)\mu_{k-1} \right) \\ &= \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1}) \end{aligned} μk=k1j=1kxj=k1(xk+j=1k1xj)=k1(xk+(k1)μk1)=μk1+k1(xkμk1)
Algorithm:

  1. Update V ( s ) V(s) V(s) incrementally after episode S 1 , A 1 , R 2 , … , S T S_1, A_1, R_2, \dots, S_T S1,A1,R2,,ST
  2. For each state S t S_t St with return G t G_t Gt
    1. N ( S t ) ← N ( S t ) + 1 N(S_t) \leftarrow N(S_t) + 1 N(St)N(St)+1​​
    2. V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} \left(G_t - V(S_t) \right) V(St)V(St)+N(St)1(GtV(St))​​​
  3. In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
    1. V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left(G_t - V(S_t) \right) V(St)V(St)+α(GtV(St))

Temporal-Difference Learning

Feature: compare with MC, please click here

  • learn directly from episodes of experience
  • model-free
  • learns from incomplete episodes, by bootstrapping
  • updates a guess towards a guess

MC vs. TD

  1. Incremental every-visit Monte-Carlo

    1. Update value V ( S t ) V(S_t ) V(St)​ toward actual return G t G_t Gt
      V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}G_t} - V(S_t) \right) V(St)V(St)+α(GtV(St))
  2. Simplest temporal-difference learning algorithm: TD(0)

    1. Update value V ( S t ) V(S_t ) V(St) toward estimated return R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1)
      V ( S t ) ← V ( S t ) + α ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}R_{t+1} + \gamma V(S_{t+1})} - V(S_t) \right) V(St)V(St)+α(Rt+1+γV(St+1)V(St))
    2. R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1) is called the TD target
    3. δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) δt=Rt+1+γV(St+1)V(St)​ is called the TD error
  3. MC has high variance, zero bias. Not very sensitive to initial value
    TD has low variance, some bias. More sensitive to initial value

  4. TD exploits Markov property
    MC does not exploit Markov property

Unified View

Dynamic Programming Backup

在这里插入图片描述

Mento-Carlo Backup

在这里插入图片描述

Temporal-Difference Backup

在这里插入图片描述

Unified View of RL

在这里插入图片描述

TD( λ \lambda λ)

n-step TD

Consider the following n-step returns for n = 1 , 2 , … , ∞ n = 1,2,\dots, \infty n=1,2,,
n = 1 (TD)   G t ( 1 ) = R t + 1 + γ V ( S t + 1 ) n = 2    G t ( 2 ) = R t + 1 + + γ R t + 2 + γ 2 V ( S t + 1 ) ⋮ n = ∞ (MC)   G t ( ∞ ) = R t + 1 + + γ R t + 2 + ⋯ + γ T − 1 R T \begin{aligned} n=1 \text{(TD)}\quad \quad \ G_t^{(1)} &= R_{t+1} + \gamma V(S_{t+1}) \\ n=2 \qquad \quad \quad \ \ G_t^{(2)} &= R_{t+1} + +\gamma R_{t+2} + \gamma^2 V(S_{t+1}) \\ \vdots \\ n=\infty \text{(MC)} \quad \ G_t^{(\infty)} &= R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{T-1}R_T \end{aligned} n=1(TD) Gt(1)n=2  Gt(2)n=(MC) Gt()=Rt+1+γV(St+1)=Rt+1++γRt+2+γ2V(St+1)=Rt+1++γRt+2++γT1RT
n-step temporal-difference learning
V ( S t ) ← V ( S t ) + α ( G t ( n ) − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^{(n)} - V(S_t) \right) V(St)V(St)+α(Gt(n)V(St))

Forward View of T D ( λ ) TD(\lambda) TD(λ)

λ \lambda λ​-return

对同一个状态 S t S_t St, 也许有许多不同的 n-step returns,为了有效的利用这些信息,我们对其取加权平均:

Using Weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} (1λ)λn1​​​, so G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)} Gtλ=(1λ)n=1λn1Gt(n)

TD( λ \lambda λ​​):
V ( S t ) ← V ( S t ) + α ( G t λ − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^\lambda - V(S_t) \right) V(St)V(St)+α(GtλV(St))
并且权值满足加和为1: ∑ weight = ∑ n = 1 ∞ ( 1 − λ ) λ n − 1 = ( 1 − λ ) λ 0 ( 1 − λ ∞ ) 1 − λ = 1 \sum \text{weight} = \sum_{n=1}^\infty (1-\lambda)\lambda^{n-1}= (1-\lambda) \frac{\lambda^0(1-\lambda^\infty)}{1-\lambda} = 1 weight=n=1(1λ)λn1=(1λ)1λλ0(1λ)=1

Feature:
  • Update value function towards the λ \lambda λ-return
  • Forward-view looks into the future to compute G t λ G_t^\lambda Gtλ
  • Like MC, can only be computed from complete episodes

Backward View of T D ( λ ) TD(\lambda) TD(λ)

Eligibility Traces(资格迹)

indicate 某状态 St 在 step t 时的影响
E 0 ( s ) = 0 E t ( s ) = γ λ E t − 1 ( s ) + 1 ( S t = s ) \begin{aligned} E_0(s) &= 0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t=s) \end{aligned} E0(s)Et(s)=0=γλEt1(s)+1(St=s)
在这里插入图片描述


  • Keep an eligibility trace for every state s
  • For each step of episode, update value V ( s ) V(s) V(s) for every state s
  • In proportion to TD-error δ t \delta_t δt and eligibility trace E t ( s ) E_t (s) Et(s)

δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin{aligned} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned} δtV(s)=Rt+1+γV(St+1)V(St)V(s)+αδtEt(s)

Relationship Between Forward and Backward TD
Online
  • When λ = 0 \lambda=0 λ=0, only current state s t s_t st is updated for step t
    E t ( s ) = 1 ( s = s t ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin{aligned} E_t(s) &= 1(s = s_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned} Et(s)V(s)=1(s=st)V(s)+αδtEt(s)
    This is eaactly equivalent to TD(0) update
    V ( S t ) ← V ( S t ) + α δ t V(S_t) \leftarrow V(S_t) + \alpha \delta_t V(St)V(St)+αδt

  • When λ = 1 \lambda = 1 λ=1, credit is deferred until end of episode. Consider an episode where s is visited once at time-step k.
    TD(1) eligibility trace discounts time since visit,
    E t ( s ) = γ E t − 1 ( s ) + 1 ( S t = s ) = { 0 if  t < k γ t − k if  t ≥ k E_t(s) = \gamma E_{t-1}(s)+ 1(S_t = s) = \begin{cases} 0 \quad \text{if}\ t<k \\ \gamma^{t-k} \quad \text{if}\ t ≥ k \end{cases} Et(s)=γEt1(s)+1(St=s)={0if t<kγtkif tk
    TD(1) updates accumulate error online,
    ∑ t = 1 T − 1 α δ t E t ( s ) = α ∑ t = k T − 1 γ t − k δ t = δ t + γ δ t + 1 + γ 2 δ t + 2 + … + γ T − 1 − k δ T − 1 = R t + 1 + γ V ( S t + 1 ) − V ( S t ) + γ ( R t + 2 + γ V ( S t + 2 ) − V ( S t + 1 ) ) + γ 2 ( R t + 3 + γ V ( S t + 3 ) − V ( S t + 2 ) ) ⋮ + γ T − 1 − k ( R T + γ V ( S T ) − V ( S T − 1 ) ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … + γ T − 1 − k R T + γ T − t V ( S T ) − V ( S ) , V ( S T ) = 0 = G k − V ( S k ) \begin{aligned} \sum_{t=1}^{T-1}\alpha \delta_t E_t(s) &= \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \ldots + \gamma^{T-1-k}\delta_{T-1} \\ &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ &+ \gamma\left(R_{t+2}+ \gamma V(S_{t+2}) - V(S_{t+1}) \right) \\ &+ \gamma^2 \left(R_{t+3}+ \gamma V(S_{t+3}) - V(S_{t+2}) \right) \\ \vdots \\ &+ \gamma^{T-1-k} \left(R_{T}+ \gamma V(S_{T}) - V(S_{T-1}) \right) \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-1-k}R_T +\gamma^{T-t} V(S_{T}) - V(S), \quad V(S_T) = 0 \\ &= G_k - V(S_k) \end{aligned} t=1T1αδtEt(s)=αt=kT1γtkδt=δt+γδt+1+γ2δt+2++γT1kδT1=Rt+1+γV(St+1)V(St)+γ(Rt+2+γV(St+2)V(St+1))+γ2(Rt+3+γV(St+3)V(St+2))+γT1k(RT+γV(ST)V(ST1))=Rt+1+γRt+2+γ2Rt+3++γT1kRT+γTtV(ST)V(S),V(ST)=0=GkV(Sk)

    1. TD(1) is roughly equivalent to every-visit Monte-Carlo
    2. Error is accumulated online, step-by-step
    3. If value function is only updated offline at end of episode, then total update is exactly the same as MC
    offline

    backward view = forward view 点击查看参考网页

    说实话backward这块我没太搞懂,希望有大神解惑一下,为什么一开始要引入eligibility trace?offline 和 online 区别在哪?怎么体现在证明过程中的?

    Algorithm

在这里插入图片描述

Compare forward and backward view

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值