人工智障学习笔记——强化学习(4)时间差分方法

本文链接：https://blog.csdn.net/sm9sun/article/details/79540725

前两章我们学习了动态规划DP方法和蒙特卡洛MC方法，DP方法的特性是状态转移，状态值函数的估计是自举的(bootstrapping)，即当前状态值函数的更新依赖于已知的其他状态值函数。MC方法的特性是不需要环境模型，状态值函数的估计是相互独立的，但同时又依赖episode tasks。为了解决即不需要环境模型，又局限于episode task，还可以用于连续任务的问题，我们衍生出了时间差分学习方法。

时间差分学习(Temporal-Difference learning, TD learning)结合了动态规划和蒙特卡洛方法，是强化学习的核心思想。名字含义为通过当前时间的差分数据来进行学习。

策略状态价值vπ的时间差分学习方法

单步时间差分学习方法TD(0)

Initialize V(s)V(s) arbitrarily ∀s∈S+∀s∈S+
Repeat (for each episode):
  Initialize SS
  Repeat (for each step of episode):
   A←A← action given by ππ for SS
   Take action AA, observe R,S′R,S′
   V(S)←V(S)+α[R+γV(S′)−V(S)]V(S)←V(S)+α[R+γV(S′)−V(S)]
   S←S′S←S′
  Until S is terminal

多步时间差分学习方法

Input: the policy ππ to be evaluated
Initialize V(s)V(s) arbitrarily ∀s∈S∀s∈S
Parameters: step size α∈(0,1]α∈(0,1], a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t<Tt<T, then:
    Take an action according to π( ˙|St)π( ˙|St)
    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
    If St+1St+1 is terminal, then T←t+1T←t+1
   τ←t−n+1 τ←t−n+1  (ττ is the time whose state's estimate is being updated)
   If τ≥0τ≥0:
    G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
    if τ+n≤Tτ+n≤T then: G←G+γnV(Sτ+n)(G(n)τ)G←G+γnV(Sτ+n)(Gτ(n))
    V(Sτ)←V(Sτ)+α[G−V(Sτ)]V(Sτ)←V(Sτ)+α[G−V(Sτ)]
  Until τ=T−1

V(S0)V(S0) 是由V(S0),V(S1),…,V(Sn)V(S0),V(S1),…,V(Sn)计算所得；V(S1)V(S1)是由V(S1),V(S1),…,V(Sn+1)V(S1),V(S1),…,V(Sn+1)。

策略行动价值qπ的on-policy时间差分学习方法: Sarsa

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
  Repeat (for each step of episode):
   Take action AA, observe R,S′R,S′
   Choose A′A′ from S′S′ using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
   Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]
   S←S′;A←A′;S←S′;A←A′;
  Until S is terminal

多步时间差分学习方法

Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t<Tt<T, then:
    Take an action AtAt
    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
    If St+1St+1 is terminal, then:
     T←t+1T←t+1
    Else:
     Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)
   τ←t−n+1 τ←t−n+1  (ττ is the time whose state's estimate is being updated)
   If τ≥0τ≥0:
    G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
    if τ+n≤Tτ+n≤T then: G←G+γnQ(Sτ+n,Aτ+n)(G(n)τ)G←G+γnQ(Sτ+n,Aτ+n)(Gτ(n))
    Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q
  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法: Q-learning

Q-learning 算法（Watkins, 1989）是一个突破性的算法。这里利用了这个公式进行off-policy学习。

Q(St,At)←Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)−Q(St,At)]

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
  Repeat (for each step of episode):
   Take action AA, observe R,S′R,S′
   Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]
   S←S′;S←S′;
  Until S is terminal

Double Q-learning的单步时间差分学习方法

Initialize Q1(s,a)Q1(s,a) and Q2(s,a),∀s∈S,a∈A(s)Q2(s,a),∀s∈S,a∈A(s) arbitrarily
Initialize Q1(terminal, ˙)=Q2(terminal, ˙)=0Q1(terminal, ˙)=Q2(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Repeat (for each step of episode):
   Choose AA from SS using policy derived from Q1Q1 and Q2Q2 (e.g. ϵ−greedyϵ−greedy)
   Take action AA, observe R,S′R,S′
   With 0.5 probability:
    Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]
   Else:
    Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]
   S←S′;S←S′;
  Until S is terminal

策略行动价值qπ的off-policy时间差分学习方法(by importance sampling): Sarsa

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0，∀s∈S,a∈Aμ(a|s)>0，∀s∈S,a∈A
Initialize Q(s，a)Q(s，a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t<Tt<T, then:
    Take an action AtAt
    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
    If St+1St+1 is terminal, then:
     T←t+1T←t+1
    Else:
     Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)
   τ←t−n+1 τ←t−n+1  (ττ is the time whose state's estimate is being updated)
   If τ≥0τ≥0:
    ρ←∏min(τ+n−1,T−1)i=τ+1π(At|St)μ(At|St)(ρ(τ+1)τ+n)ρ←∏i=τ+1min(τ+n−1,T−1)π(At|St)μ(At|St)(ρτ+n(τ+1))
    G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
    if τ+n≤Tτ+n≤T then: G←G+γnQ(Sτ+n,Aτ+n)(G(n)τ)G←G+γnQ(Sτ+n,Aτ+n)(Gτ(n))
    Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q
  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。求行动价值的期望值意味着对所有可能的行动a都评估一次。

多步时间差分学习方法

Initialize Q(s，a)Q(s，a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
  Q0←Q(S0,A0)Q0←Q(S0,A0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t<Tt<T, then:
    Take an action AtAt
    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
    If St+1St+1 is terminal, then:
     T←t+1T←t+1
     δt←R−Qtδt←R−Qt
    Else:
     δt←R+γ∑aπ(a|St+1)Q(St+1,a)−Qtδt←R+γ∑aπ(a|St+1)Q(St+1,a)−Qt
     Select arbitrarily and store an action as At+1At+1
     Qt+1←Q(St+1,At+1)Qt+1←Q(St+1,At+1)
     πt+1←π(St+1,At+1)πt+1←π(St+1,At+1)
   τ←t−n+1 τ←t−n+1  (ττ is the time whose state's estimate is being updated)
   If τ≥0τ≥0:
    E←1E←1
    G←QτG←Qτ
    For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):
     G← G+EδkG← G+Eδk
     E← γEπk+1E← γEπk+1
    Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)
  Until τ=T−1

策略行动价值qπ的off-policy时间差分学习方法: Q(σ)

Q(σ) 结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法，并考虑了重要样本。
当σ=1时，使用了重要样本的Sarsa算法。
当σ=0时，使用了Tree Backup的行动期望值算法。

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0，∀s∈S,a∈Aμ(a|s)>0，∀s∈S,a∈A
Initialize Q(s，a)Q(s，a) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
  Q0←Q(S0,A0)Q0←Q(S0,A0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t<Tt<T, then:
    Take an action AtAt
    Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
    If St+1St+1 is terminal, then:
     T←t+1T←t+1
     δt←R−Qtδt←R−Qt
    Else:
     Select and store an action as At+1∼μ( ˙|St+1)At+1∼μ( ˙|St+1)
     Select and store σt+1)σt+1)
     Qt+1←Q(St+1,At+1)Qt+1←Q(St+1,At+1)
     δt←R+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qtδt←R+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qt
     πt+1←π(St+1,At+1)πt+1←π(St+1,At+1)
     ρt+1←π(At+1|St+1)μ(At+1|St+1)ρt+1←π(At+1|St+1)μ(At+1|St+1)
   τ←t−n+1 τ←t−n+1  (ττ is the time whose state's estimate is being updated)
   If τ≥0τ≥0:
    ρ←1ρ←1
    E←1E←1
    G←QτG←Qτ
    For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):
     G← G+EδkG← G+Eδk
     E← γE[(1−σk+1)πk+1+σk+1]E← γE[(1−σk+1)πk+1+σk+1]
     ρ← ρ(1−σk+σkτk)ρ← ρ(1−σk+σkτk)
    Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]
    If ππ is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)
  Until τ=T−1

总结：如果说蒙特卡洛的方法是模拟（或者经历）一段情节，在情节结束后，根据情节上各个状态的价值，来估计状态价值。那么时间差分学习是模拟（或者经历）一段情节，每行动一步（或者几步），根据新状态的价值，然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时间差分学习方法。