前两章我们学习了动态规划DP方法和蒙特卡洛MC方法,DP方法的特性是状态转移,状态值函数的估计是自举的(bootstrapping),即当前状态值函数的更新依赖于已知的其他状态值函数。MC方法的特性是不需要环境模型,状态值函数的估计是相互独立的,但同时又依赖episode tasks。为了解决即不需要环境模型,又局限于episode task,还可以用于连续任务的问题,我们衍生出了时间差分学习方法。
时间差分学习(Temporal-Difference learning, TD learning)结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。名字含义为通过当前时间的差分数据来进行学习。
策略状态价值vπ的时间差分学习方法
单步时间差分学习方法TD(0)
Initialize V(s)V(s) arbitrarily ∀s∈S+∀s∈S+
Repeat (for each episode):
Initialize SS
Repeat (for each step of episode):
A←A← action given by ππ for SS
Take action AA, observe R,S′R,S′
V(S)←V(S)+α[R+γV(S′)−V(S)]V(S)←V(S)+α[R+γV(S′)−V(S)]
S←S′S←S′
Until S is terminal
多步时间差分学习方法
Input: the policy ππ to be evaluated
Initialize V(s)V(s) arbitrarily ∀s∈S∀s∈S
Parameters: step size α∈(0,1]α∈(0,1], a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn
Repeat (for each episode):
Initialize and store S0≠terminalS0≠terminal
T←∞T←∞
For t=0,1,2,⋯t=0,1,2,⋯:
If t<Tt<T, then:
Take an action according to π( ˙|St)π( ˙|St)
Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
If St+1St+1 is terminal, then T←t+1T←t+1
τ←t−n+1 τ←t−n+1 (ττ is the time whose state's estimate is being updated)
If τ≥0τ≥0:
G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
if τ+n≤Tτ+n≤T then: G←G+γnV(Sτ+n)(G(n)τ)G←G+γnV(Sτ+n)(Gτ(n))
V(Sτ)←V(Sτ)+α[G−V(Sτ)]V(Sτ)←V(Sτ)+α[G−V(Sτ)]
Until τ=T−1
V(S0)V(S0) 是由V(S0),V(S1),…,V(Sn)V(S0),V(S1),…,V(Sn)计算所得;V(S1)V(S1)是由V(S1),V(S1),…,V(Sn+1)V(S1),V(S1),…,V(Sn+1)。
策略行动价值qπ的on-policy时间差分学习方法: Sarsa
单步时间差分学习方法
Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
Initialize SS
Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
Repeat (for each step of episode):
Take action AA, observe R,S′R,S′
Choose A′A′ from S′S′ using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]
S←S′;A←A′;S←S′;A←A′;
Until S is terminal
多步时间差分学习方法
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
small ϵ>0ϵ>0
a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn
Repeat (for each episode):
Initialize and store S0≠terminalS0≠terminal
Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
T←∞T←∞
For t=0,1,2,⋯t=0,1,2,⋯:
If t<Tt<T, then:
Take an action AtAt
Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
If St+1St+1 is terminal, then:
T←t+1T←t+1
Else:
Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)
τ←t−n+1 τ←t−n+1 (ττ is the time whose state's estimate is being updated)
If τ≥0τ≥0:
G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
if τ+n≤Tτ+n≤T then: G←G+γnQ(Sτ+n,Aτ+n)(G(n)τ)G←G+γnQ(Sτ+n,Aτ+n)(Gτ(n))
Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]
If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q
Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法: Q-learning
Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。
Q(St,At)←Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)−Q(St,At)]
单步时间差分学习方法
Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
Initialize SS
Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
Repeat (for each step of episode):
Take action AA, observe R,S′R,S′
Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]
S←S′;S←S′;
Until S is terminal
Double Q-learning的单步时间差分学习方法
Initialize Q1(s,a)Q1(s,a) and Q2(s,a),∀s∈S,a∈A(s)Q2(s,a),∀s∈S,a∈A(s) arbitrarily
Initialize Q1(terminal, ˙)=Q2(terminal, ˙)=0Q1(terminal, ˙)=Q2(terminal, ˙)=0
Repeat (for each episode):
Initialize SS
Repeat (for each step of episode):
Choose AA from SS using policy derived from Q1Q1 and Q2Q2 (e.g. ϵ−greedyϵ−greedy)
Take action AA, observe R,S′R,S′
With 0.5 probability:
Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]
Else:
Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]
S←S′;S←S′;
Until S is terminal
策略行动价值qπ的off-policy时间差分学习方法(by importance sampling): Sarsa
多步时间差分学习方法
Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈A
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
small ϵ>0ϵ>0
a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn
Repeat (for each episode):
Initialize and store S0≠terminalS0≠terminal
Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
T←∞T←∞
For t=0,1,2,⋯t=0,1,2,⋯:
If t<Tt<T, then:
Take an action AtAt
Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
If St+1St+1 is terminal, then:
T←t+1T←t+1
Else:
Select and store an action At+1∼π( ˙|St+1)At+1∼π( ˙|St+1)
τ←t−n+1 τ←t−n+1 (ττ is the time whose state's estimate is being updated)
If τ≥0τ≥0:
ρ←∏min(τ+n−1,T−1)i=τ+1π(At|St)μ(At|St)(ρ(τ+1)τ+n)ρ←∏i=τ+1min(τ+n−1,T−1)π(At|St)μ(At|St)(ρτ+n(τ+1))
G←∑min(τ+n,T)i=τ+1γi−τ−1RiG←∑i=τ+1min(τ+n,T)γi−τ−1Ri
if τ+n≤Tτ+n≤T then: G←G+γnQ(Sτ+n,Aτ+n)(G(n)τ)G←G+γnQ(Sτ+n,Aτ+n)(Gτ(n))
Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]
If {\pi} is being learned, then ensure that π( ˙|Sτ)π( ˙|Sτ) is ϵϵ-greedy wrt Q
Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法(不带importance sampling): Tree Backup Algorithm
Tree Backup Algorithm的思想是每步都求行动价值的期望值。求行动价值的期望值意味着对所有可能的行动a都评估一次。
多步时间差分学习方法
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
small ϵ>0ϵ>0
a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn
Repeat (for each episode):
Initialize and store S0≠terminalS0≠terminal
Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
Q0←Q(S0,A0)Q0←Q(S0,A0)
T←∞T←∞
For t=0,1,2,⋯t=0,1,2,⋯:
If t<Tt<T, then:
Take an action AtAt
Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
If St+1St+1 is terminal, then:
T←t+1T←t+1
δt←R−Qtδt←R−Qt
Else:
δt←R+γ∑aπ(a|St+1)Q(St+1,a)−Qtδt←R+γ∑aπ(a|St+1)Q(St+1,a)−Qt
Select arbitrarily and store an action as At+1At+1
Qt+1←Q(St+1,At+1)Qt+1←Q(St+1,At+1)
πt+1←π(St+1,At+1)πt+1←π(St+1,At+1)
τ←t−n+1 τ←t−n+1 (ττ is the time whose state's estimate is being updated)
If τ≥0τ≥0:
E←1E←1
G←QτG←Qτ
For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):
G← G+EδkG← G+Eδk
E← γEπk+1E← γEπk+1
Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+α[G−Q(Sτ,Aτ)]
If {\pi} is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)
Until τ=T−1
策略行动价值qπ的off-policy时间差分学习方法: Q(σ)
Q(σ) 结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当σ=1时,使用了重要样本的Sarsa算法。
当σ=0时,使用了Tree Backup的行动期望值算法。
多步时间差分学习方法
Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈A
Initialize Q(s,a)Q(s,a) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
small ϵ>0ϵ>0
a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn
Repeat (for each episode):
Initialize and store S0≠terminalS0≠terminal
Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
Q0←Q(S0,A0)Q0←Q(S0,A0)
T←∞T←∞
For t=0,1,2,⋯t=0,1,2,⋯:
If t<Tt<T, then:
Take an action AtAt
Observe and store the next reward as Rt+1Rt+1 and the next state as St+1St+1
If St+1St+1 is terminal, then:
T←t+1T←t+1
δt←R−Qtδt←R−Qt
Else:
Select and store an action as At+1∼μ( ˙|St+1)At+1∼μ( ˙|St+1)
Select and store σt+1)σt+1)
Qt+1←Q(St+1,At+1)Qt+1←Q(St+1,At+1)
δt←R+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qtδt←R+γσt+1Qt+1+γ(1−σt+1)∑aπ(a|St+1)Q(St+1,a)−Qt
πt+1←π(St+1,At+1)πt+1←π(St+1,At+1)
ρt+1←π(At+1|St+1)μ(At+1|St+1)ρt+1←π(At+1|St+1)μ(At+1|St+1)
τ←t−n+1 τ←t−n+1 (ττ is the time whose state's estimate is being updated)
If τ≥0τ≥0:
ρ←1ρ←1
E←1E←1
G←QτG←Qτ
For k=τ,…,min(τ+n−1,T−1):k=τ,…,min(τ+n−1,T−1):
G← G+EδkG← G+Eδk
E← γE[(1−σk+1)πk+1+σk+1]E← γE[(1−σk+1)πk+1+σk+1]
ρ← ρ(1−σk+σkτk)ρ← ρ(1−σk+σkτk)
Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]Q(Sτ,Aτ)←Q(Sτ,Aτ)+αρ[G−Q(Sτ,Aτ)]
If ππ is being learned, then ensure that π(a|Sτ)π(a|Sτ) is ϵϵ-greedy wrt Q(Sτ, ˙)Q(Sτ, ˙)
Until τ=T−1
总结:如果说蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。那么时间差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时间差分学习方法。