文章目录
【例子】
✨例子1:
w = E [ X ] w=\mathbb{E}[X] w=E[X]
-
对于一些样本 { x } \{x\} {x} of X X X
-
假设 g ( w ) = w − E [ X ] g(w)=w-\mathbb{E}[X] g(w)=w−E[X],我们将问题转化为RF问题令 g ( w ) = 0 g(w)=0 g(w)=0
-
由于我们能从X中获得 x {x} x:
g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η \tilde{g}(w, \eta)=w-x=(w-\mathbb{E}[X])+(\mathbb{E}[X]-x) \doteq g(w)+\eta g~(w,η)=w−x=(w−E[X])+(E[X]−x)≐g(w)+η -
依据RM算法:
w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k ( w k − x k ) w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left(w_k-x_k\right) wk+1=wk−αkg~(wk,ηk)=wk−αk(wk−xk)
✨例子2:
w = E [ v ( X ) ] , w=\mathbb{E}[v(X)], w=E[v(X)],
-
对于一些样本 { x } \{x\} {x} of X X X
-
为了解决这个问题:
g ( w ) = w − E [ v ( X ) ] g ~ ( w , η ) = w − v ( x ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η \begin{aligned} g(w) & =w-\mathbb{E}[v(X)] \\ \tilde{g}(w, \eta) & =w-v(x)=(w-\mathbb{E}[v(X)])+(\mathbb{E}[v(X)]-v(x)) \doteq g(w)+\eta \end{aligned} g(w)g~(w,η)=w−E[v(X)]=w−v(x)=(w−E[v(X)])+(E[v(X)]−v(x))≐g(w)+η -
依据RM算法:
w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − v ( x k ) ] w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left[w_k-v\left(x_k\right)\right] wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−v(xk)]
✨例子3:
w = E [ R + γ v ( X ) ] w=\mathbb{E}[R+\gamma v(X)] w=E[R+γv(X)]
-
其中 R , X R,X R,X是随机变量, γ \gamma γ是常量, v ( ⋅ ) v(\cdot) v(⋅)是函数
-
我们能获得 { x } { r } \{x\}\{r\} {x}{r}的采样从X和R中
g ( w ) = w − E [ R + γ v ( X ) ] g ~ ( w , η ) = w − [ r + γ v ( x ) ] = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) ≐ g ( w ) + η \begin{aligned} g(w) & =w-\mathbb{E}[R+\gamma v(X)] \\ \tilde{g}(w, \eta) & =w-[r+\gamma v(x)] \\ & =(w-\mathbb{E}[R+\gamma v(X)])+(\mathbb{E}[R+\gamma v(X)]-[r+\gamma v(x)]) \\ & \doteq g(w)+\eta \end{aligned} g(w)g~(w,η)=w−E[R+γv(X)]=w−[r+γv(x)]=(w−E[R+γv(X)])+(E[R+γv(X)]−[r+γv(x)])≐g(w)+η -
依据RM算法:
w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − ( r k + γ v ( x k ) ) ] w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left[w_k-\left(r_k+\gamma v\left(x_k\right)\right)\right] wk+1=wk−αkg~(wk,ηk)=wk−αk[wk−(rk+γv(xk))]
【state value的TD算法】
TD算法是基于数据而不是基于模型的算法:
数据:依据给定的策略
π
\pi
π产生的
(
s
0
,
r
1
,
s
1
,
…
,
s
t
,
r
t
+
1
,
s
t
+
1
,
…
)
\left(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots\right)
(s0,r1,s1,…,st,rt+1,st+1,…)也可以写成集合的形式
{
(
s
t
,
r
t
+
1
,
s
t
+
1
)
}
t
\left\{\left(s_t, r_{t+1}, s_{t+1}\right)\right\}_t
{(st,rt+1,st+1)}t
v
t
+
1
(
s
t
)
=
v
t
(
s
t
)
−
α
t
(
s
t
)
[
v
t
(
s
t
)
−
[
r
t
+
1
+
γ
v
t
(
s
t
+
1
)
]
]
v
t
+
1
(
s
)
=
v
t
(
s
)
,
∀
s
≠
s
t
,
\begin{aligned} v_{t+1}\left(s_t\right) & =v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\left[r_{t+1}+\gamma v_t\left(s_{t+1}\right)\right]\right] \\ v_{t+1}(s) & =v_t(s), \quad \forall s \neq s_t, \end{aligned}
vt+1(st)vt+1(s)=vt(st)−αt(st)[vt(st)−[rt+1+γvt(st+1)]]=vt(s),∀s=st,
算法:
v
t
+
1
(
s
t
)
⏟
new estimate
=
v
t
(
s
t
)
⏟
current estimate
−
α
t
(
s
t
)
[
v
t
(
s
t
)
−
[
r
t
+
1
+
γ
v
t
(
s
t
+
1
)
⏟
TD target
v
ˉ
t
]
⏞
TD error
δ
t
]
,
\underbrace{v_{t+1}\left(s_t\right)}_{\text {new estimate }}=\underbrace{v_t\left(s_t\right)}_{\text {current estimate }}-\alpha_t\left(s_t\right)[\overbrace{v_t\left(s_t\right)-[\underbrace{r_{t+1}+\gamma v_t\left(s_{t+1}\right)}_{\text {TD target } \bar{v}_t}]}^{\text {TD error } \delta_t}] \text {, }
new estimate
vt+1(st)=current estimate
vt(st)−αt(st)[vt(st)−[TD target vˉt
rt+1+γvt(st+1)]
TD error δt],
-
TD target: v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar{v}_t \doteq r_{t+1}+\gamma v\left(s_{t+1}\right) vˉt≐rt+1+γv(st+1),将 v ( s t ) v(s_t) v(st)向着 v ˉ t \bar{v}_t vˉt改进,这是一个目标值
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − v ˉ t ] ⟹ v t + 1 ( s t ) − v ˉ t = v t ( s t ) − v ˉ t − α t ( s t ) [ v t ( s t ) − v ˉ t ] ⟹ v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] ⟹ ∣ v t + 1 ( s t ) − v ˉ t ∣ = ∣ 1 − α t ( s t ) ∣ ∣ v t ( s t ) − v ˉ t ∣ \begin{aligned} & v_{t+1}\left(s_t\right)=v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & v_{t+1}\left(s_t\right)-\bar{v}_t=v_t\left(s_t\right)-\bar{v}_t-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & v_{t+1}\left(s_t\right)-\bar{v}_t=\left[1-\alpha_t\left(s_t\right)\right]\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & \left|v_{t+1}\left(s_t\right)-\bar{v}_t\right|=\left|1-\alpha_t\left(s_t\right)\right|\left|v_t\left(s_t\right)-\bar{v}_t\right| \end{aligned} ⟹⟹⟹vt+1(st)=vt(st)−αt(st)[vt(st)−vˉt]vt+1(st)−vˉt=vt(st)−vˉt−αt(st)[vt(st)−vˉt]vt+1(st)−vˉt=[1−αt(st)][vt(st)−vˉt]∣vt+1(st)−vˉt∣=∣1−αt(st)∣∣vt(st)−vˉt∣
由于 α t ( s t ) \alpha_t\left(s_t\right) αt(st)是一个很小的数字,所以:
0 < 1 − α t ( s t ) < 1 0<1-\alpha_t\left(s_t\right)<1 0<1−αt(st)<1
因此:
∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ \left|v_{t+1}\left(s_t\right)-\bar{v}_t\right| \leq\left|v_t\left(s_t\right)-\bar{v}_t\right| ∣vt+1(st)−vˉt∣≤∣vt(st)−vˉt∣
意味着 v ( s t ) v\left(s_t\right) v(st)趋向于 v ˉ t ! \bar{v}_{t} ! vˉt! -
TD error: δ t ≐ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \doteq v\left(s_t\right)-\left[r_{t+1}+\gamma v\left(s_{t+1}\right)\right]=v\left(s_t\right)-\bar{v}_t δt≐v(st)−[rt+1+γv(st+1)]=v(st)−vˉt,
-
它描述了两个时刻,所以是时序差分
-
描述了 v t v_t vt和 v π v_\pi vπ的误差:
-
δ π , t ≐ v π ( s t ) − [ r t + 1 + γ v π ( s t + 1 ) ] \delta_{\pi, t} \doteq v_\pi\left(s_t\right)-\left[r_{t+1}+\gamma v_\pi\left(s_{t+1}\right)\right] δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]
-
对其求期望得: E [ δ π , t ∣ S t = s t ] = v π ( s t ) − E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s t ] = 0 \mathbb{E}\left[\delta_{\pi, t} \mid S_t=s_t\right]=v_\pi\left(s_t\right)-\mathbb{E}\left[R_{t+1}+\gamma v_\pi\left(S_{t+1}\right) \mid S_t=s_t\right]=0 E[δπ,t∣St=st]=vπ(st)−E[Rt+1+γvπ(St+1)∣St=st]=0
- v t = v π v_t=v_\pi vt=vπ时候 δ t \delta_{t} δt为0
- v t ! = v π v_t !=v_\pi vt!=vπ时候 v t v_t vt还不等于 v π v_{\pi} vπ
-
-
-
性质:
- 给定一个策略估计他的state value,不能估计action value,也不能寻找最优策略
问题1:TD算法在数学上干什么?
回答1:TD算法是在没有模型的情况下解决给定策略 π \pi π的贝尔曼公式
v π ( s ) = E [ R + γ G ∣ S = s ] , s ∈ S v_\pi(s)=\mathbb{E}[R+\gamma G \mid S=s], \quad s \in \mathcal{S} vπ(s)=E[R+γG∣S=s],s∈S
由于
G
G
G是discounted return,所以:
E
[
G
∣
S
=
s
]
=
∑
a
π
(
a
∣
s
)
∑
s
′
p
(
s
′
∣
s
,
a
)
v
π
(
s
′
)
=
E
[
v
π
(
S
′
)
∣
S
=
s
]
\mathbb{E}[G \mid S=s]=\sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)=\mathbb{E}\left[v_\pi\left(S^{\prime}\right) \mid S=s\right]
E[G∣S=s]=a∑π(a∣s)s′∑p(s′∣s,a)vπ(s′)=E[vπ(S′)∣S=s]
所以最初的式子可以变成(Bellman expectation equation):
v
π
(
s
)
=
E
[
R
+
γ
v
π
(
S
′
)
∣
S
=
s
]
,
s
∈
S
.
v_\pi(s)=\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid S=s\right], \quad s \in \mathcal{S} .
vπ(s)=E[R+γvπ(S′)∣S=s],s∈S.
使用RM算法进行求解,首先定义:
g
(
v
(
s
)
)
=
v
(
s
)
−
E
[
R
+
γ
v
π
(
S
′
)
∣
s
]
g(v(s))=v(s)-\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right]
g(v(s))=v(s)−E[R+γvπ(S′)∣s]
我们令:
g
(
v
(
s
)
)
=
0
g(v(s))=0
g(v(s))=0
由于我们有
r
r
r和
s
′
s'
s′的采样:
g
~
(
v
(
s
)
)
=
v
(
s
)
−
[
r
+
γ
v
π
(
s
′
)
]
=
(
v
(
s
)
−
E
[
R
+
γ
v
π
(
S
′
)
∣
s
]
)
⏟
g
(
v
(
s
)
)
+
(
E
[
R
+
γ
v
π
(
S
′
)
∣
s
]
−
[
r
+
γ
v
π
(
s
′
)
]
)
⏟
η
.
\begin{aligned} \tilde{g}(v(s)) & =v(s)-\left[r+\gamma v_\pi\left(s^{\prime}\right)\right] \\ & =\underbrace{\left(v(s)-\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right]\right)}_{g(v(s))}+\underbrace{\left(\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right]-\left[r+\gamma v_\pi\left(s^{\prime}\right)\right]\right)}_\eta . \end{aligned}
g~(v(s))=v(s)−[r+γvπ(s′)]=g(v(s))
(v(s)−E[R+γvπ(S′)∣s])+η
(E[R+γvπ(S′)∣s]−[r+γvπ(s′)]).
所以RM算法为了解决
g
(
v
(
s
)
)
=
0
g(v(s))=0
g(v(s))=0有:
v
k
+
1
(
s
)
=
v
k
(
s
)
−
α
k
g
~
(
v
k
(
s
)
)
=
v
k
(
s
)
−
α
k
(
v
k
(
s
)
−
[
r
k
+
γ
v
π
(
s
k
′
)
]
)
,
k
=
1
,
2
,
3
,
…
\begin{aligned} v_{k+1}(s) & =v_k(s)-\alpha_k \tilde{g}\left(v_k(s)\right) \\ & =v_k(s)-\alpha_k\left(v_k(s)-\left[r_k+\gamma v_\pi\left(s_k^{\prime}\right)\right]\right), \quad k=1,2,3, \ldots \end{aligned}
vk+1(s)=vk(s)−αkg~(vk(s))=vk(s)−αk(vk(s)−[rk+γvπ(sk′)]),k=1,2,3,…
v
k
(
s
)
v_k(s)
vk(s)是
v
π
(
s
)
v_\pi(s)
vπ(s)在第k步的估计;
r
k
,
s
k
′
r_k, s_k^{\prime}
rk,sk′是从
R
,
S
′
R, S^{\prime}
R,S′采样
✨TD learning 与 MC learning 比较:
【action value的TD算法(Sarsa)】
目标:给定策略 π \pi π估计策略
假设我们有每一时刻的经验
{
(
s
t
,
a
t
,
r
t
+
1
,
s
t
+
1
,
a
t
+
1
)
}
t
\left\{\left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right)\right\}_t
{(st,at,rt+1,st+1,at+1)}t. -> Sarsa名字由来(state-action-reward-state-action)
q
t
+
1
(
s
t
,
a
t
)
=
q
t
(
s
t
,
a
t
)
−
α
t
(
s
t
,
a
t
)
[
q
t
(
s
t
,
a
t
)
−
[
r
t
+
1
+
γ
q
t
(
s
t
+
1
,
a
t
+
1
)
]
]
q
t
+
1
(
s
,
a
)
=
q
t
(
s
,
a
)
,
∀
(
s
,
a
)
≠
(
s
t
,
a
t
)
\begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right] \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right) \end{aligned}
qt+1(st,at)qt+1(s,a)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]]=qt(s,a),∀(s,a)=(st,at)
相当于将函数state value变成了action value公式:
V
(
S
t
)
→
q
(
S
t
,
a
t
)
V\left(S_t\right) \rightarrow q\left(S_t, a_t\right)
V(St)→q(St,at)
✨Sarsa 伪代码:
-
对每个episode:
-
如果当前的状态不是target state
-
收集经验 ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right) (st,at,rt+1,st+1,at+1)
-
更新q-value: q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\right.\right. \left.\left.\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right]\end{aligned} qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]]
-
更新policy:
π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 ) if a = arg max a q t + 1 ( s t , a ) π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣ otherwise \begin{aligned} & \pi_{t+1}\left(a \mid s_t\right)=1-\frac{\epsilon}{|\mathcal{A}|}(|\mathcal{A}|-1) \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{t+1}\left(a \mid s_t\right)=\frac{\epsilon}{|\mathcal{A}|} \text { otherwise } \end{aligned} πt+1(a∣st)=1−∣A∣ϵ(∣A∣−1) if a=argamaxqt+1(st,a)πt+1(a∣st)=∣A∣ϵ otherwise
-
-
【action value的TD算法(Expected Sarsa)】
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ) ] ) ] q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , \begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]\right)\right] \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right), \end{aligned} qt+1(st,at)qt+1(s,a)=qt(st,at)−αt(st,at)[qt(st,at)−(rt+1+γE[qt(st+1,A)])]=qt(s,a),∀(s,a)=(st,at),
其中: E [ q t ( s t + 1 , A ) ] ) = ∑ a π t ( a ∣ s t + 1 ) q t ( s t + 1 , a ) ≐ v t ( s t + 1 ) \left.\mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]\right)=\sum_a \pi_t\left(a \mid s_{t+1}\right) q_t\left(s_{t+1}, a\right) \doteq v_t\left(s_{t+1}\right) E[qt(st+1,A)])=∑aπt(a∣st+1)qt(st+1,a)≐vt(st+1)
✨与Sarsa比较:
- TD target从 r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right) rt+1+γqt(st+1,at+1)变到了 r t + 1 + γ E [ q t ( s t + 1 , A ) ] r_{t+1}+\gamma \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right] rt+1+γE[qt(st+1,A)]
- 计算量更大,随机性减少因为采样从 { s t , a t , r t + 1 , s t + 1 , a t + 1 } \left\{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right\} {st,at,rt+1,st+1,at+1} 变到 { s t , a t , r t + 1 , s t + 1 } \left\{s_t, a_t, r_{t+1}, s_{t+1}\right\} {st,at,rt+1,st+1}.
【action value的TD算法(n-step Sarsa)】
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s, a)=\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right] qπ(s,a)=E[Gt∣St=s,At=a]
上面是action value的定义,return
G
t
G_t
Gt的写法可以写成多种样子:
Sarsa
⟵
G
t
(
1
)
=
R
t
+
1
+
γ
q
π
(
S
t
+
1
,
A
t
+
1
)
,
G
t
(
2
)
=
R
t
+
1
+
γ
R
t
+
2
+
γ
2
q
π
(
S
t
+
2
,
A
t
+
2
)
,
⋮
n-step Sarsa
⟵
G
t
(
n
)
=
R
t
+
1
+
γ
R
t
+
2
+
⋯
+
γ
n
q
π
(
S
t
+
n
,
A
t
+
n
)
⋮
MC
⟵
G
t
(
∞
)
=
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
\begin{aligned} \text { Sarsa } \longleftarrow \quad & G_t^{(1)}=R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right), \\ & G_t^{(2)}=R_{t+1}+\gamma R_{t+2}+\gamma^2 q_\pi\left(S_{t+2}, A_{t+2}\right),\\ & \vdots \\ \text { n-step Sarsa } \longleftarrow \quad & G_t^{(n)}=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right)\\ & \vdots \\ \text { MC } \longleftarrow \quad & G_t^{(\infty)}=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots\\ \end{aligned}
Sarsa ⟵ n-step Sarsa ⟵ MC ⟵Gt(1)=Rt+1+γqπ(St+1,At+1),Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),⋮Gt(n)=Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)⋮Gt(∞)=Rt+1+γRt+2+γ2Rt+3+…
其中
G
t
=
G
t
(
1
)
=
G
t
(
2
)
=
G
t
(
n
)
=
G
t
(
∞
)
G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)}
Gt=Gt(1)=Gt(2)=Gt(n)=Gt(∞),只不过分解的不一样
G
t
(
1
)
G_t^{(1)}
Gt(1)(Sarsa):
q
π
(
s
,
a
)
=
E
[
G
t
(
1
)
∣
s
,
a
]
=
E
[
R
t
+
1
+
γ
q
π
(
S
t
+
1
,
A
t
+
1
)
∣
s
,
a
]
q_\pi(s, a)=\mathbb{E}\left[G_t^{(1)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right) \mid s, a\right]
qπ(s,a)=E[Gt(1)∣s,a]=E[Rt+1+γqπ(St+1,At+1)∣s,a]
G
t
(
∞
)
G_t^{(\infty)}
Gt(∞)(MC):
q
π
(
s
,
a
)
=
E
[
G
t
(
∞
)
∣
s
,
a
]
=
E
[
R
t
+
1
+
γ
R
t
+
2
+
γ
2
R
t
+
3
+
…
∣
s
,
a
]
q_\pi(s, a)=\mathbb{E}\left[G_t^{(\infty)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid s, a\right]
qπ(s,a)=E[Gt(∞)∣s,a]=E[Rt+1+γRt+2+γ2Rt+3+…∣s,a]
n-step Sarsa:
q
π
(
s
,
a
)
=
E
[
G
t
(
n
)
∣
s
,
a
]
=
E
[
R
t
+
1
+
γ
R
t
+
2
+
⋯
+
γ
n
q
π
(
S
t
+
n
,
A
t
+
n
)
∣
s
,
a
]
q_\pi(s, a)=\mathbb{E}\left[G_t^{(n)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right) \mid s, a\right]
qπ(s,a)=E[Gt(n)∣s,a]=E[Rt+1+γRt+2+⋯+γnqπ(St+n,At+n)∣s,a]
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ] ] . \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^n q_t\left(s_{t+n}, a_{t+n}\right)\right]\right] . \end{aligned} qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γrt+2+⋯+γnqt(st+n,at+n)]].
- n=1:该式子就变成了Sarsa
- n= ∞ \infty ∞:该式子就变成了MC方法
【optimal action value的TD算法(Q-learning)】
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max a ∈ A q t ( s t + 1 , a ) ] ] , q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , \begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right)\right]\right], \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right), \end{aligned} qt+1(st,at)qt+1(s,a)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γa∈Amaxqt(st+1,a)]],=qt(s,a),∀(s,a)=(st,at),
- TD target(Sarsa): r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right) rt+1+γqt(st+1,at+1)
- TD target(Q-learning): r t + 1 + γ max a ∈ A q t ( s t + 1 , a ) r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right) rt+1+γmaxa∈Aqt(st+1,a)
Q-learning解决问题:
q
(
s
,
a
)
=
E
[
R
t
+
1
+
γ
max
a
q
(
S
t
+
1
,
a
)
∣
S
t
=
s
,
A
t
=
a
]
,
∀
s
,
a
q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right], \quad \forall s, a
q(s,a)=E[Rt+1+γamaxq(St+1,a)∣St=s,At=a],∀s,a
解决贝尔曼最优方程
✨on-policy learning && off-policy learning:
- behavior policy:和环境进行交互生成experience
- target policy:一直更新目标获得最优的策略
on-policy:behavior policy 和 target policy 是一样的,用这个策略和环境进行交互得到这个experience再来改进这个策略,再进行交互(Sarsa、Q-learning)
off-policy:behavior policy 和 target policy 是不一样的,用一个policy去获得大量的经验,然后用这些经验来不断改进这个策略,用那个策略最终会收敛到一个最优策略。(Q-learning)可以从别人的经验直接用
✨Sarsa、MC、Q-learning判别:
Sarsa 是 on-policy:
-
Sarsa 目标是对于给定策略 π \pi π解决一个贝尔曼公式:
q π ( s , a ) = E [ R + γ q π ( S ′ , A ′ ) ∣ s , a ] , ∀ s , a q_\pi(s, a)=\mathbb{E}\left[R+\gamma q_\pi\left(S^{\prime}, A^{\prime}\right) \mid s, a\right], \quad \forall s, a qπ(s,a)=E[R+γqπ(S′,A′)∣s,a],∀s,a
其中: R ∼ p ( R ∣ s , a ) , S ′ ∼ p ( S ′ ∣ s , a ) , A ′ ∼ π ( A ′ ∣ S ′ ) R \sim p(R \mid s, a), S^{\prime} \sim p\left(S^{\prime} \mid s, a\right), A^{\prime} \sim \pi\left(A^{\prime} \mid S^{\prime}\right) R∼p(R∣s,a),S′∼p(S′∣s,a),A′∼π(A′∣S′) -
算法:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γqt(st+1,at+1)]]
( s t , a t , r t + 1 , s t + 1 , a t + 1 ) : \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right): (st,at,rt+1,st+1,at+1):-
如果 ( s t , a t ) (s_t,a_t) (st,at)给定,则 r t + 1 r_{t+1} rt+1和 s t + 1 s_{t+1} st+1不依赖于策略而是 p ( r ∣ s , a ) , p ( s ′ ∣ s , a ) p(r \mid s, a), p\left(s^{\prime} \mid s, a\right) p(r∣s,a),p(s′∣s,a)
-
a t + 1 a_{t+1} at+1依据 π t ( s t + 1 ) \pi_{t}(s_t+1) πt(st+1) 不但是behavior policy而且是target policy
-
MC 是 on-policy:
-
MC 目标是估计action value:
q π ( s , a ) = E [ R t + 1 + γ R t + 2 + … ∣ S t = s , A t = a ] , ∀ s , a q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\ldots \mid S_t=s, A_t=a\right], \quad \forall s, a qπ(s,a)=E[Rt+1+γRt+2+…∣St=s,At=a],∀s,a
-
算法:
q ( s , a ) ≈ r t + 1 + γ r t + 2 + … q(s, a) \approx r_{t+1}+\gamma r_{t+2}+\ldots q(s,a)≈rt+1+γrt+2+…
Q-learning 是 off-policy:
-
Q-learning 目标是求解贝尔曼最优公式:
q ( s , a ) = E [ R t + 1 + γ max a q ( S t + 1 , a ) ∣ S t = s , A t = a ] , ∀ s , a q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right], \quad \forall s, a q(s,a)=E[Rt+1+γamaxq(St+1,a)∣St=s,At=a],∀s,a -
算法:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max a ∈ A q t ( s t + 1 , a ) ] ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right)\right]\right] qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γa∈Amaxqt(st+1,a)]]
需要: ( s t , a t , r t + 1 , s t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}\right) (st,at,rt+1,st+1)- 如果 ( s t , a t ) \left(s_t, a_t\right) (st,at)给定则 r t + 1 r_{t+1} rt+1和 s t + 1 s_{t+1} st+1不依赖策略而是由 p ( r ∣ s , a ) p ( s ′ ∣ s , a ) p(r \mid s, a) \quad p\left(s^{\prime} \mid s, a\right) p(r∣s,a)p(s′∣s,a)这两个概率决定的
✨Q-learning伪代码(on-policy):
-
对每个episode:
-
如果当前的状态不是target state
-
收集经验 ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right) (st,at,rt+1,st+1,at+1)
-
更新q-value: q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max a q t ( s t + 1 , a ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _a q_t\left(s_{t+1}, a\right)\right]\right]\end{aligned} qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−[rt+1+γamaxqt(st+1,a)]]
-
更新policy:
π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 ) if a = arg max a q t + 1 ( s t , a ) π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣ otherwise \begin{aligned} & \pi_{t+1}\left(a \mid s_t\right)=1-\frac{\epsilon}{|\mathcal{A}|}(|\mathcal{A}|-1) \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{t+1}\left(a \mid s_t\right)=\frac{\epsilon}{|\mathcal{A}|} \text { otherwise } \end{aligned} πt+1(a∣st)=1−∣A∣ϵ(∣A∣−1) if a=argamaxqt+1(st,a)πt+1(a∣st)=∣A∣ϵ otherwise
-
-
✨Q-learning伪代码(off-policy):
-
对于每个episode { s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , … } \left\{s_0, a_0, r_1, s_1, a_1, r_2, \ldots\right\} {s0,a0,r1,s1,a1,r2,…}生成 π b \pi_b πb
-
对episode每一步 t = 0 , 1 , 2 , … t=0,1,2, \ldots t=0,1,2,…
-
更新q-value:
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q ( s t , a t ) − [ r t + 1 + γ max a q t ( s t + 1 , a ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _a q_t\left(s_{t+1}, a\right)\right]\right] \end{aligned} qt+1(st,at)=qt(st,at)−αt(st,at)[q(st,at)−[rt+1+γamaxqt(st+1,a)]] -
更新target policy:
π T , t + 1 ( a ∣ s t ) = 1 if a = arg max a q t + 1 ( s t , a ) π T , t + 1 ( a ∣ s t ) = 0 otherwise \begin{aligned} & \pi_{T, t+1}\left(a \mid s_t\right)=1 \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{T, t+1}\left(a \mid s_t\right)=0 \text { otherwise } \end{aligned} πT,t+1(a∣st)=1 if a=argamaxqt+1(st,a)πT,t+1(a∣st)=0 otherwise
-
【比较】
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − q ˉ t ] , q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\bar{q}_t\right], qt+1(st,at)=qt(st,at)−αt(st,at)[qt(st,at)−qˉt],
TD算法目标就是不断地接近于TD target( q ˉ t \bar{q}_t qˉt),如下所示不同算法的不同在于不同的 q ˉ t \bar{q}_t qˉt
这些算法实际上是解决贝尔曼公式或贝尔曼最优公式的随即逼近算法: