文章目录
Model-Free Prediction
Estimate the value function of an unknown MDP
Mento-Carlo Learning
Feature:
- learn directly from episodes of experience
- model-free: no knowledge of MDP transitions / rewards
- learns from complete episodes: no bootstrapping
- uses the simplest possible idea: value = mean return
- can only apply MC to episodic MDPs: all episodes must terminate
Monte-Carlo Policy Evaluation
Goal: learn
v
π
\operatorname{v}_\pi
vπ from episodes of experience under policy
π
\pi
π
S
1
,
A
1
,
R
2
,
…
,
S
k
∼
π
S_1, A_1, R_2, \dots, S_k \sim \pi
S1,A1,R2,…,Sk∼π
the difinition of value funtion:
V
π
=
E
[
G
t
∣
S
t
=
s
]
\operatorname{V}_\pi = \mathbb{E}\left[G_t \mid S_t =s\right]
Vπ=E[Gt∣St=s], but Monte-Carlo policy evaluation uses empirical mean return instead of expected return.
First-Visit Monte-Carlo Policy Evaluation
Algorithm:
- To evaluate state S
- The first time-step t that state s is visited in an episode,
- 增加计数: N ( s ) ← N ( s ) + 1 N(s) \leftarrow N(s) + 1 N(s)←N(s)+1
- 增加总回报return: S ( s ) ← S ( s ) + G t S(s) \leftarrow S(s) + G_t S(s)←S(s)+Gt
- Value is estimated by mean return V ( s ) = S ( s ) N ( s ) V(s) = \frac{S(s)}{N(s)} V(s)=N(s)S(s)
- By law of arge numbers, $V(s) \rightarrow \operatorname{v}_\pi(s)\ as\ N(s) \rightarrow \infty $ (大数定律)
Incremental Mento-Carlo
foundation:
The mean
μ
1
,
μ
2
,
…
\mu_1,\mu_2, \dots
μ1,μ2,… of a sequence
x
1
,
x
2
,
…
x_1,x_2, \dots
x1,x2,… can be computed incrementally,
μ
k
=
1
k
∑
j
=
1
k
x
j
=
1
k
(
x
k
+
∑
j
=
1
k
−
1
x
j
)
=
1
k
(
x
k
+
(
k
−
1
)
μ
k
−
1
)
=
μ
k
−
1
+
1
k
(
x
k
−
μ
k
−
1
)
\begin{aligned} \mu_k &= \frac{1}{k} \sum_{j=1}^k x_j \\ &= \frac{1}{k} \left(x_k + \sum_{j=1}^{k-1} x_j \right) \\ &= \frac{1}{k} \left(x_k + (k-1)\mu_{k-1} \right) \\ &= \mu_{k-1} + \frac{1}{k}(x_k - \mu_{k-1}) \end{aligned}
μk=k1j=1∑kxj=k1(xk+j=1∑k−1xj)=k1(xk+(k−1)μk−1)=μk−1+k1(xk−μk−1)
Algorithm:
- Update V ( s ) V(s) V(s) incrementally after episode S 1 , A 1 , R 2 , … , S T S_1, A_1, R_2, \dots, S_T S1,A1,R2,…,ST
- For each state
S
t
S_t
St with return
G
t
G_t
Gt
- N ( S t ) ← N ( S t ) + 1 N(S_t) \leftarrow N(S_t) + 1 N(St)←N(St)+1
- V ( S t ) ← V ( S t ) + 1 N ( S t ) ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \frac{1}{N(S_t)} \left(G_t - V(S_t) \right) V(St)←V(St)+N(St)1(Gt−V(St))
- In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
- V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left(G_t - V(S_t) \right) V(St)←V(St)+α(Gt−V(St))
Temporal-Difference Learning
Feature: compare with MC, please click here
- learn directly from episodes of experience
- model-free
- learns from incomplete episodes, by bootstrapping
- updates a guess towards a guess
MC vs. TD
-
Incremental every-visit Monte-Carlo
- Update value
V
(
S
t
)
V(S_t )
V(St) toward actual return
G
t
G_t
Gt
V ( S t ) ← V ( S t ) + α ( G t − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}G_t} - V(S_t) \right) V(St)←V(St)+α(Gt−V(St))
- Update value
V
(
S
t
)
V(S_t )
V(St) toward actual return
G
t
G_t
Gt
-
Simplest temporal-difference learning algorithm: TD(0)
- Update value
V
(
S
t
)
V(S_t )
V(St) toward estimated return
R
t
+
1
+
γ
V
(
S
t
+
1
)
R_{t+1} + \gamma V(S_{t+1})
Rt+1+γV(St+1)
V ( S t ) ← V ( S t ) + α ( R t + 1 + γ V ( S t + 1 ) − V ( S t ) ) V(S_t) \leftarrow V(S_t) + \alpha \left({\color{red}R_{t+1} + \gamma V(S_{t+1})} - V(S_t) \right) V(St)←V(St)+α(Rt+1+γV(St+1)−V(St)) - R t + 1 + γ V ( S t + 1 ) R_{t+1} + \gamma V(S_{t+1}) Rt+1+γV(St+1) is called the TD target
- δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) \delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) δt=Rt+1+γV(St+1)−V(St) is called the TD error
- Update value
V
(
S
t
)
V(S_t )
V(St) toward estimated return
R
t
+
1
+
γ
V
(
S
t
+
1
)
R_{t+1} + \gamma V(S_{t+1})
Rt+1+γV(St+1)
-
MC has high variance, zero bias. Not very sensitive to initial value
TD has low variance, some bias. More sensitive to initial value -
TD exploits Markov property
MC does not exploit Markov property
Unified View
Dynamic Programming Backup
Mento-Carlo Backup
Temporal-Difference Backup
Unified View of RL
TD( λ \lambda λ)
n-step TD
Consider the following n-step returns for
n
=
1
,
2
,
…
,
∞
n = 1,2,\dots, \infty
n=1,2,…,∞
n
=
1
(TD)
G
t
(
1
)
=
R
t
+
1
+
γ
V
(
S
t
+
1
)
n
=
2
G
t
(
2
)
=
R
t
+
1
+
+
γ
R
t
+
2
+
γ
2
V
(
S
t
+
1
)
⋮
n
=
∞
(MC)
G
t
(
∞
)
=
R
t
+
1
+
+
γ
R
t
+
2
+
⋯
+
γ
T
−
1
R
T
\begin{aligned} n=1 \text{(TD)}\quad \quad \ G_t^{(1)} &= R_{t+1} + \gamma V(S_{t+1}) \\ n=2 \qquad \quad \quad \ \ G_t^{(2)} &= R_{t+1} + +\gamma R_{t+2} + \gamma^2 V(S_{t+1}) \\ \vdots \\ n=\infty \text{(MC)} \quad \ G_t^{(\infty)} &= R_{t+1} + +\gamma R_{t+2} + \dots + \gamma^{T-1}R_T \end{aligned}
n=1(TD) Gt(1)n=2 Gt(2)⋮n=∞(MC) Gt(∞)=Rt+1+γV(St+1)=Rt+1++γRt+2+γ2V(St+1)=Rt+1++γRt+2+⋯+γT−1RT
n-step temporal-difference learning
V
(
S
t
)
←
V
(
S
t
)
+
α
(
G
t
(
n
)
−
V
(
S
t
)
)
V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^{(n)} - V(S_t) \right)
V(St)←V(St)+α(Gt(n)−V(St))
Forward View of T D ( λ ) TD(\lambda) TD(λ)
λ \lambda λ-return
对同一个状态 S t S_t St, 也许有许多不同的 n-step returns,为了有效的利用这些信息,我们对其取加权平均:
Using Weight ( 1 − λ ) λ n − 1 (1-\lambda)\lambda^{n-1} (1−λ)λn−1, so G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G_t^\lambda = (1-\lambda)\sum_{n=1}^\infty \lambda^{n-1}G_t^{(n)} Gtλ=(1−λ)∑n=1∞λn−1Gt(n)
TD(
λ
\lambda
λ):
V
(
S
t
)
←
V
(
S
t
)
+
α
(
G
t
λ
−
V
(
S
t
)
)
V(S_t) \leftarrow V(S_t) + \alpha \left(G_t^\lambda - V(S_t) \right)
V(St)←V(St)+α(Gtλ−V(St))
并且权值满足加和为1:
∑
weight
=
∑
n
=
1
∞
(
1
−
λ
)
λ
n
−
1
=
(
1
−
λ
)
λ
0
(
1
−
λ
∞
)
1
−
λ
=
1
\sum \text{weight} = \sum_{n=1}^\infty (1-\lambda)\lambda^{n-1}= (1-\lambda) \frac{\lambda^0(1-\lambda^\infty)}{1-\lambda} = 1
∑weight=∑n=1∞(1−λ)λn−1=(1−λ)1−λλ0(1−λ∞)=1
Feature:
- Update value function towards the λ \lambda λ-return
- Forward-view looks into the future to compute G t λ G_t^\lambda Gtλ
- Like MC, can only be computed from complete episodes
Backward View of T D ( λ ) TD(\lambda) TD(λ)
Eligibility Traces(资格迹)
indicate 某状态 St 在 step t 时的影响
E
0
(
s
)
=
0
E
t
(
s
)
=
γ
λ
E
t
−
1
(
s
)
+
1
(
S
t
=
s
)
\begin{aligned} E_0(s) &= 0 \\ E_t(s) &= \gamma \lambda E_{t-1}(s) + 1(S_t=s) \end{aligned}
E0(s)Et(s)=0=γλEt−1(s)+1(St=s)
- Keep an eligibility trace for every state s
- For each step of episode, update value V ( s ) V(s) V(s) for every state s
- In proportion to TD-error δ t \delta_t δt and eligibility trace E t ( s ) E_t (s) Et(s)
δ t = R t + 1 + γ V ( S t + 1 ) − V ( S t ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin{aligned} \delta_t &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned} δtV(s)=Rt+1+γV(St+1)−V(St)←V(s)+αδtEt(s)
Relationship Between Forward and Backward TD
Online
-
When λ = 0 \lambda=0 λ=0, only current state s t s_t st is updated for step t
E t ( s ) = 1 ( s = s t ) V ( s ) ← V ( s ) + α δ t E t ( s ) \begin{aligned} E_t(s) &= 1(s = s_t) \\ V(s) &\leftarrow V(s) + \alpha \delta_tE_t(s) \end{aligned} Et(s)V(s)=1(s=st)←V(s)+αδtEt(s)
This is eaactly equivalent to TD(0) update
V ( S t ) ← V ( S t ) + α δ t V(S_t) \leftarrow V(S_t) + \alpha \delta_t V(St)←V(St)+αδt -
When λ = 1 \lambda = 1 λ=1, credit is deferred until end of episode. Consider an episode where s is visited once at time-step k.
TD(1) eligibility trace discounts time since visit,
E t ( s ) = γ E t − 1 ( s ) + 1 ( S t = s ) = { 0 if t < k γ t − k if t ≥ k E_t(s) = \gamma E_{t-1}(s)+ 1(S_t = s) = \begin{cases} 0 \quad \text{if}\ t<k \\ \gamma^{t-k} \quad \text{if}\ t ≥ k \end{cases} Et(s)=γEt−1(s)+1(St=s)={0if t<kγt−kif t≥k
TD(1) updates accumulate error online,
∑ t = 1 T − 1 α δ t E t ( s ) = α ∑ t = k T − 1 γ t − k δ t = δ t + γ δ t + 1 + γ 2 δ t + 2 + … + γ T − 1 − k δ T − 1 = R t + 1 + γ V ( S t + 1 ) − V ( S t ) + γ ( R t + 2 + γ V ( S t + 2 ) − V ( S t + 1 ) ) + γ 2 ( R t + 3 + γ V ( S t + 3 ) − V ( S t + 2 ) ) ⋮ + γ T − 1 − k ( R T + γ V ( S T ) − V ( S T − 1 ) ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … + γ T − 1 − k R T + γ T − t V ( S T ) − V ( S ) , V ( S T ) = 0 = G k − V ( S k ) \begin{aligned} \sum_{t=1}^{T-1}\alpha \delta_t E_t(s) &= \alpha \sum_{t=k}^{T-1} \gamma^{t-k} \delta_t = \delta_t + \gamma\delta_{t+1} + \gamma^2\delta_{t+2} + \ldots + \gamma^{T-1-k}\delta_{T-1} \\ &= R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \\ &+ \gamma\left(R_{t+2}+ \gamma V(S_{t+2}) - V(S_{t+1}) \right) \\ &+ \gamma^2 \left(R_{t+3}+ \gamma V(S_{t+3}) - V(S_{t+2}) \right) \\ \vdots \\ &+ \gamma^{T-1-k} \left(R_{T}+ \gamma V(S_{T}) - V(S_{T-1}) \right) \\ &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots + \gamma^{T-1-k}R_T +\gamma^{T-t} V(S_{T}) - V(S), \quad V(S_T) = 0 \\ &= G_k - V(S_k) \end{aligned} t=1∑T−1αδtEt(s)⋮=αt=k∑T−1γt−kδt=δt+γδt+1+γ2δt+2+…+γT−1−kδT−1=Rt+1+γV(St+1)−V(St)+γ(Rt+2+γV(St+2)−V(St+1))+γ2(Rt+3+γV(St+3)−V(St+2))+γT−1−k(RT+γV(ST)−V(ST−1))=Rt+1+γRt+2+γ2Rt+3+…+γT−1−kRT+γT−tV(ST)−V(S),V(ST)=0=Gk−V(Sk)- TD(1) is roughly equivalent to every-visit Monte-Carlo
- Error is accumulated online, step-by-step
- If value function is only updated offline at end of episode, then total update is exactly the same as MC
offline
backward view = forward view 点击查看参考网页
说实话backward这块我没太搞懂,希望有大神解惑一下,为什么一开始要引入eligibility trace?offline 和 online 区别在哪?怎么体现在证明过程中的?
Algorithm