【强化学习的数学原理】第七课:时序差分方法

【例子】

✨例子1:

w = E [ X ] w=\mathbb{E}[X] w=E[X]

  • 对于一些样本 { x } \{x\} {x} of X X X

  • 假设 g ( w ) = w − E [ X ] g(w)=w-\mathbb{E}[X] g(w)=wE[X],我们将问题转化为RF问题令 g ( w ) = 0 g(w)=0 g(w)=0

  • 由于我们能从X中获得 x {x} x
    g ~ ( w , η ) = w − x = ( w − E [ X ] ) + ( E [ X ] − x ) ≐ g ( w ) + η \tilde{g}(w, \eta)=w-x=(w-\mathbb{E}[X])+(\mathbb{E}[X]-x) \doteq g(w)+\eta g~(w,η)=wx=(wE[X])+(E[X]x)g(w)+η

  • 依据RM算法:
    w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k ( w k − x k ) w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left(w_k-x_k\right) wk+1=wkαkg~(wk,ηk)=wkαk(wkxk)

✨例子2:

w = E [ v ( X ) ] , w=\mathbb{E}[v(X)], w=E[v(X)],

  • 对于一些样本 { x } \{x\} {x} of X X X

  • 为了解决这个问题:
    g ( w ) = w − E [ v ( X ) ] g ~ ( w , η ) = w − v ( x ) = ( w − E [ v ( X ) ] ) + ( E [ v ( X ) ] − v ( x ) ) ≐ g ( w ) + η \begin{aligned} g(w) & =w-\mathbb{E}[v(X)] \\ \tilde{g}(w, \eta) & =w-v(x)=(w-\mathbb{E}[v(X)])+(\mathbb{E}[v(X)]-v(x)) \doteq g(w)+\eta \end{aligned} g(w)g~(w,η)=wE[v(X)]=wv(x)=(wE[v(X)])+(E[v(X)]v(x))g(w)+η

  • 依据RM算法:
    w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − v ( x k ) ] w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left[w_k-v\left(x_k\right)\right] wk+1=wkαkg~(wk,ηk)=wkαk[wkv(xk)]

✨例子3:

w = E [ R + γ v ( X ) ] w=\mathbb{E}[R+\gamma v(X)] w=E[R+γv(X)]

  • 其中 R , X R,X R,X是随机变量, γ \gamma γ是常量, v ( ⋅ ) v(\cdot) v()是函数

  • 我们能获得 { x } { r } \{x\}\{r\} {x}{r}的采样从X和R中
    g ( w ) = w − E [ R + γ v ( X ) ] g ~ ( w , η ) = w − [ r + γ v ( x ) ] = ( w − E [ R + γ v ( X ) ] ) + ( E [ R + γ v ( X ) ] − [ r + γ v ( x ) ] ) ≐ g ( w ) + η \begin{aligned} g(w) & =w-\mathbb{E}[R+\gamma v(X)] \\ \tilde{g}(w, \eta) & =w-[r+\gamma v(x)] \\ & =(w-\mathbb{E}[R+\gamma v(X)])+(\mathbb{E}[R+\gamma v(X)]-[r+\gamma v(x)]) \\ & \doteq g(w)+\eta \end{aligned} g(w)g~(w,η)=wE[R+γv(X)]=w[r+γv(x)]=(wE[R+γv(X)])+(E[R+γv(X)][r+γv(x)])g(w)+η

  • 依据RM算法:
    w k + 1 = w k − α k g ~ ( w k , η k ) = w k − α k [ w k − ( r k + γ v ( x k ) ) ] w_{k+1}=w_k-\alpha_k \tilde{g}\left(w_k, \eta_k\right)=w_k-\alpha_k\left[w_k-\left(r_k+\gamma v\left(x_k\right)\right)\right] wk+1=wkαkg~(wk,ηk)=wkαk[wk(rk+γv(xk))]

【state value的TD算法】

TD算法是基于数据而不是基于模型的算法:

数据:依据给定的策略 π \pi π产生的 ( s 0 , r 1 , s 1 , … , s t , r t + 1 , s t + 1 , … ) \left(s_0, r_1, s_1, \ldots, s_t, r_{t+1}, s_{t+1}, \ldots\right) (s0,r1,s1,,st,rt+1,st+1,)也可以写成集合的形式 { ( s t , r t + 1 , s t + 1 ) } t \left\{\left(s_t, r_{t+1}, s_{t+1}\right)\right\}_t {(st,rt+1,st+1)}t
v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − [ r t + 1 + γ v t ( s t + 1 ) ] ] v t + 1 ( s ) = v t ( s ) , ∀ s ≠ s t , \begin{aligned} v_{t+1}\left(s_t\right) & =v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\left[r_{t+1}+\gamma v_t\left(s_{t+1}\right)\right]\right] \\ v_{t+1}(s) & =v_t(s), \quad \forall s \neq s_t, \end{aligned} vt+1(st)vt+1(s)=vt(st)αt(st)[vt(st)[rt+1+γvt(st+1)]]=vt(s),s=st,
算法
v t + 1 ( s t ) ⏟ new estimate  = v t ( s t ) ⏟ current estimate  − α t ( s t ) [ v t ( s t ) − [ r t + 1 + γ v t ( s t + 1 ) ⏟ TD target  v ˉ t ] ⏞ TD error  δ t ] ,  \underbrace{v_{t+1}\left(s_t\right)}_{\text {new estimate }}=\underbrace{v_t\left(s_t\right)}_{\text {current estimate }}-\alpha_t\left(s_t\right)[\overbrace{v_t\left(s_t\right)-[\underbrace{r_{t+1}+\gamma v_t\left(s_{t+1}\right)}_{\text {TD target } \bar{v}_t}]}^{\text {TD error } \delta_t}] \text {, } new estimate  vt+1(st)=current estimate  vt(st)αt(st)[vt(st)[TD target vˉt rt+1+γvt(st+1)] TD error δt]

  • TD target: v ˉ t ≐ r t + 1 + γ v ( s t + 1 ) \bar{v}_t \doteq r_{t+1}+\gamma v\left(s_{t+1}\right) vˉtrt+1+γv(st+1),将 v ( s t ) v(s_t) v(st)向着 v ˉ t \bar{v}_t vˉt改进,这是一个目标值

    v t + 1 ( s t ) = v t ( s t ) − α t ( s t ) [ v t ( s t ) − v ˉ t ] ⟹ v t + 1 ( s t ) − v ˉ t = v t ( s t ) − v ˉ t − α t ( s t ) [ v t ( s t ) − v ˉ t ] ⟹ v t + 1 ( s t ) − v ˉ t = [ 1 − α t ( s t ) ] [ v t ( s t ) − v ˉ t ] ⟹ ∣ v t + 1 ( s t ) − v ˉ t ∣ = ∣ 1 − α t ( s t ) ∣ ∣ v t ( s t ) − v ˉ t ∣ \begin{aligned} & v_{t+1}\left(s_t\right)=v_t\left(s_t\right)-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & v_{t+1}\left(s_t\right)-\bar{v}_t=v_t\left(s_t\right)-\bar{v}_t-\alpha_t\left(s_t\right)\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & v_{t+1}\left(s_t\right)-\bar{v}_t=\left[1-\alpha_t\left(s_t\right)\right]\left[v_t\left(s_t\right)-\bar{v}_t\right] \\ \Longrightarrow & \left|v_{t+1}\left(s_t\right)-\bar{v}_t\right|=\left|1-\alpha_t\left(s_t\right)\right|\left|v_t\left(s_t\right)-\bar{v}_t\right| \end{aligned} vt+1(st)=vt(st)αt(st)[vt(st)vˉt]vt+1(st)vˉt=vt(st)vˉtαt(st)[vt(st)vˉt]vt+1(st)vˉt=[1αt(st)][vt(st)vˉt]vt+1(st)vˉt=1αt(st)vt(st)vˉt

    由于 α t ( s t ) \alpha_t\left(s_t\right) αt(st)是一个很小的数字,所以:
    0 < 1 − α t ( s t ) < 1 0<1-\alpha_t\left(s_t\right)<1 0<1αt(st)<1
    因此:
    ∣ v t + 1 ( s t ) − v ˉ t ∣ ≤ ∣ v t ( s t ) − v ˉ t ∣ \left|v_{t+1}\left(s_t\right)-\bar{v}_t\right| \leq\left|v_t\left(s_t\right)-\bar{v}_t\right| vt+1(st)vˉtvt(st)vˉt
    意味着 v ( s t ) v\left(s_t\right) v(st)趋向于 v ˉ t ! \bar{v}_{t} ! vˉt!

  • TD error: δ t ≐ v ( s t ) − [ r t + 1 + γ v ( s t + 1 ) ] = v ( s t ) − v ˉ t \delta_t \doteq v\left(s_t\right)-\left[r_{t+1}+\gamma v\left(s_{t+1}\right)\right]=v\left(s_t\right)-\bar{v}_t δtv(st)[rt+1+γv(st+1)]=v(st)vˉt

    • 它描述了两个时刻,所以是时序差分

    • 描述了 v t v_t vt v π v_\pi vπ的误差:

      • δ π , t ≐ v π ( s t ) − [ r t + 1 + γ v π ( s t + 1 ) ] \delta_{\pi, t} \doteq v_\pi\left(s_t\right)-\left[r_{t+1}+\gamma v_\pi\left(s_{t+1}\right)\right] δπ,tvπ(st)[rt+1+γvπ(st+1)]

      • 对其求期望得: E [ δ π , t ∣ S t = s t ] = v π ( s t ) − E [ R t + 1 + γ v π ( S t + 1 ) ∣ S t = s t ] = 0 \mathbb{E}\left[\delta_{\pi, t} \mid S_t=s_t\right]=v_\pi\left(s_t\right)-\mathbb{E}\left[R_{t+1}+\gamma v_\pi\left(S_{t+1}\right) \mid S_t=s_t\right]=0 E[δπ,tSt=st]=vπ(st)E[Rt+1+γvπ(St+1)St=st]=0

        • v t = v π v_t=v_\pi vt=vπ时候 δ t \delta_{t} δt为0
        • v t ! = v π v_t !=v_\pi vt!=vπ时候 v t v_t vt还不等于 v π v_{\pi} vπ
  • 性质:

    • 给定一个策略估计他的state value,不能估计action value,也不能寻找最优策略

问题1:TD算法在数学上干什么?

回答1:TD算法是在没有模型的情况下解决给定策略 π \pi π的贝尔曼公式

v π ( s ) = E [ R + γ G ∣ S = s ] , s ∈ S v_\pi(s)=\mathbb{E}[R+\gamma G \mid S=s], \quad s \in \mathcal{S} vπ(s)=E[R+γGS=s],sS

由于 G G G是discounted return,所以:
E [ G ∣ S = s ] = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) = E [ v π ( S ′ ) ∣ S = s ] \mathbb{E}[G \mid S=s]=\sum_a \pi(a \mid s) \sum_{s^{\prime}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)=\mathbb{E}\left[v_\pi\left(S^{\prime}\right) \mid S=s\right] E[GS=s]=aπ(as)sp(ss,a)vπ(s)=E[vπ(S)S=s]
所以最初的式子可以变成(Bellman expectation equation):
v π ( s ) = E [ R + γ v π ( S ′ ) ∣ S = s ] , s ∈ S . v_\pi(s)=\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid S=s\right], \quad s \in \mathcal{S} . vπ(s)=E[R+γvπ(S)S=s],sS.
使用RM算法进行求解,首先定义:
g ( v ( s ) ) = v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] g(v(s))=v(s)-\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right] g(v(s))=v(s)E[R+γvπ(S)s]
我们令: g ( v ( s ) ) = 0 g(v(s))=0 g(v(s))=0

由于我们有 r r r s ′ s' s的采样:
g ~ ( v ( s ) ) = v ( s ) − [ r + γ v π ( s ′ ) ] = ( v ( s ) − E [ R + γ v π ( S ′ ) ∣ s ] ) ⏟ g ( v ( s ) ) + ( E [ R + γ v π ( S ′ ) ∣ s ] − [ r + γ v π ( s ′ ) ] ) ⏟ η . \begin{aligned} \tilde{g}(v(s)) & =v(s)-\left[r+\gamma v_\pi\left(s^{\prime}\right)\right] \\ & =\underbrace{\left(v(s)-\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right]\right)}_{g(v(s))}+\underbrace{\left(\mathbb{E}\left[R+\gamma v_\pi\left(S^{\prime}\right) \mid s\right]-\left[r+\gamma v_\pi\left(s^{\prime}\right)\right]\right)}_\eta . \end{aligned} g~(v(s))=v(s)[r+γvπ(s)]=g(v(s)) (v(s)E[R+γvπ(S)s])+η (E[R+γvπ(S)s][r+γvπ(s)]).
所以RM算法为了解决 g ( v ( s ) ) = 0 g(v(s))=0 g(v(s))=0有:
v k + 1 ( s ) = v k ( s ) − α k g ~ ( v k ( s ) ) = v k ( s ) − α k ( v k ( s ) − [ r k + γ v π ( s k ′ ) ] ) , k = 1 , 2 , 3 , … \begin{aligned} v_{k+1}(s) & =v_k(s)-\alpha_k \tilde{g}\left(v_k(s)\right) \\ & =v_k(s)-\alpha_k\left(v_k(s)-\left[r_k+\gamma v_\pi\left(s_k^{\prime}\right)\right]\right), \quad k=1,2,3, \ldots \end{aligned} vk+1(s)=vk(s)αkg~(vk(s))=vk(s)αk(vk(s)[rk+γvπ(sk)]),k=1,2,3,
v k ( s ) v_k(s) vk(s) v π ( s ) v_\pi(s) vπ(s)在第k步的估计; r k , s k ′ r_k, s_k^{\prime} rk,sk是从 R , S ′ R, S^{\prime} R,S采样

✨TD learning 与 MC learning 比较:

在这里插入图片描述

在这里插入图片描述

【action value的TD算法(Sarsa)】

目标:给定策略 π \pi π估计策略

假设我们有每一时刻的经验 { ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) } t \left\{\left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right)\right\}_t {(st,at,rt+1,st+1,at+1)}t. -> Sarsa名字由来(state-action-reward-state-action)
q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) \begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right] \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right) \end{aligned} qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]]=qt(s,a),(s,a)=(st,at)
相当于将函数state value变成了action value公式: V ( S t ) → q ( S t , a t ) V\left(S_t\right) \rightarrow q\left(S_t, a_t\right) V(St)q(St,at)

✨Sarsa 伪代码:

  • 对每个episode:

    • 如果当前的状态不是target state

      • 收集经验 ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right) (st,at,rt+1,st+1,at+1)

      • 更新q-value: q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\right.\right. \left.\left.\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right]\end{aligned} qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]]

      • 更新policy:
        π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 )  if  a = arg ⁡ max ⁡ a q t + 1 ( s t , a ) π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣  otherwise  \begin{aligned} & \pi_{t+1}\left(a \mid s_t\right)=1-\frac{\epsilon}{|\mathcal{A}|}(|\mathcal{A}|-1) \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{t+1}\left(a \mid s_t\right)=\frac{\epsilon}{|\mathcal{A}|} \text { otherwise } \end{aligned} πt+1(ast)=1Aϵ(A1) if a=argamaxqt+1(st,a)πt+1(ast)=Aϵ otherwise 

【action value的TD算法(Expected Sarsa)】

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ E [ q t ( s t + 1 , A ) ] ) ] q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , \begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left(r_{t+1}+\gamma \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]\right)\right] \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right), \end{aligned} qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,A)])]=qt(s,a),(s,a)=(st,at),

其中: E [ q t ( s t + 1 , A ) ] ) = ∑ a π t ( a ∣ s t + 1 ) q t ( s t + 1 , a ) ≐ v t ( s t + 1 ) \left.\mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right]\right)=\sum_a \pi_t\left(a \mid s_{t+1}\right) q_t\left(s_{t+1}, a\right) \doteq v_t\left(s_{t+1}\right) E[qt(st+1,A)])=aπt(ast+1)qt(st+1,a)vt(st+1)

✨与Sarsa比较:

  • TD target从 r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right) rt+1+γqt(st+1,at+1)变到了 r t + 1 + γ E [ q t ( s t + 1 , A ) ] r_{t+1}+\gamma \mathbb{E}\left[q_t\left(s_{t+1}, A\right)\right] rt+1+γE[qt(st+1,A)]
  • 计算量更大,随机性减少因为采样从 { s t , a t , r t + 1 , s t + 1 , a t + 1 } \left\{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right\} {st,at,rt+1,st+1,at+1} 变到 { s t , a t , r t + 1 , s t + 1 } \left\{s_t, a_t, r_{t+1}, s_{t+1}\right\} {st,at,rt+1,st+1}.

【action value的TD算法(n-step Sarsa)】

q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s, a)=\mathbb{E}\left[G_t \mid S_t=s, A_t=a\right] qπ(s,a)=E[GtSt=s,At=a]

上面是action value的定义,return G t G_t Gt的写法可以写成多种样子:
 Sarsa  ⟵ G t ( 1 ) = R t + 1 + γ q π ( S t + 1 , A t + 1 ) , G t ( 2 ) = R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , A t + 2 ) , ⋮  n-step Sarsa  ⟵ G t ( n ) = R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ⋮  MC  ⟵ G t ( ∞ ) = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … \begin{aligned} \text { Sarsa } \longleftarrow \quad & G_t^{(1)}=R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right), \\ & G_t^{(2)}=R_{t+1}+\gamma R_{t+2}+\gamma^2 q_\pi\left(S_{t+2}, A_{t+2}\right),\\ & \vdots \\ \text { n-step Sarsa } \longleftarrow \quad & G_t^{(n)}=R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right)\\ & \vdots \\ \text { MC } \longleftarrow \quad & G_t^{(\infty)}=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots\\ \end{aligned}  Sarsa  n-step Sarsa  MC Gt(1)=Rt+1+γqπ(St+1,At+1),Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2),Gt(n)=Rt+1+γRt+2++γnqπ(St+n,At+n)Gt()=Rt+1+γRt+2+γ2Rt+3+
其中 G t = G t ( 1 ) = G t ( 2 ) = G t ( n ) = G t ( ∞ ) G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)} Gt=Gt(1)=Gt(2)=Gt(n)=Gt(),只不过分解的不一样

G t ( 1 ) G_t^{(1)} Gt(1)(Sarsa):
q π ( s , a ) = E [ G t ( 1 ) ∣ s , a ] = E [ R t + 1 + γ q π ( S t + 1 , A t + 1 ) ∣ s , a ] q_\pi(s, a)=\mathbb{E}\left[G_t^{(1)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma q_\pi\left(S_{t+1}, A_{t+1}\right) \mid s, a\right] qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a]
G t ( ∞ ) G_t^{(\infty)} Gt()(MC):
q π ( s , a ) = E [ G t ( ∞ ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + … ∣ s , a ] q_\pi(s, a)=\mathbb{E}\left[G_t^{(\infty)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \mid s, a\right] qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a]
n-step Sarsa:
q π ( s , a ) = E [ G t ( n ) ∣ s , a ] = E [ R t + 1 + γ R t + 2 + ⋯ + γ n q π ( S t + n , A t + n ) ∣ s , a ] q_\pi(s, a)=\mathbb{E}\left[G_t^{(n)} \mid s, a\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^n q_\pi\left(S_{t+n}, A_{t+n}\right) \mid s, a\right] qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a]

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ r t + 2 + ⋯ + γ n q t ( s t + n , a t + n ) ] ] . \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^n q_t\left(s_{t+n}, a_{t+n}\right)\right]\right] . \end{aligned} qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γrt+2++γnqt(st+n,at+n)]].

  • n=1:该式子就变成了Sarsa
  • n= ∞ \infty :该式子就变成了MC方法

【optimal action value的TD算法(Q-learning)】

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max ⁡ a ∈ A q t ( s t + 1 , a ) ] ] , q t + 1 ( s , a ) = q t ( s , a ) , ∀ ( s , a ) ≠ ( s t , a t ) , \begin{aligned} q_{t+1}\left(s_t, a_t\right) & =q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right)\right]\right], \\ q_{t+1}(s, a) & =q_t(s, a), \quad \forall(s, a) \neq\left(s_t, a_t\right), \end{aligned} qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γaAmaxqt(st+1,a)]],=qt(s,a),(s,a)=(st,at),

  • TD target(Sarsa): r t + 1 + γ q t ( s t + 1 , a t + 1 ) r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right) rt+1+γqt(st+1,at+1)
  • TD target(Q-learning): r t + 1 + γ max ⁡ a ∈ A q t ( s t + 1 , a ) r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right) rt+1+γmaxaAqt(st+1,a)

Q-learning解决问题
q ( s , a ) = E [ R t + 1 + γ max ⁡ a q ( S t + 1 , a ) ∣ S t = s , A t = a ] , ∀ s , a q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right], \quad \forall s, a q(s,a)=E[Rt+1+γamaxq(St+1,a)St=s,At=a],s,a
解决贝尔曼最优方程

✨on-policy learning && off-policy learning:

  • behavior policy:和环境进行交互生成experience
  • target policy:一直更新目标获得最优的策略

on-policy:behavior policy 和 target policy 是一样的,用这个策略和环境进行交互得到这个experience再来改进这个策略,再进行交互(Sarsa、Q-learning)

off-policy:behavior policy 和 target policy 是不一样的,用一个policy去获得大量的经验,然后用这些经验来不断改进这个策略,用那个策略最终会收敛到一个最优策略。(Q-learning)可以从别人的经验直接用

✨Sarsa、MC、Q-learning判别:

Sarsa 是 on-policy:

  • Sarsa 目标是对于给定策略 π \pi π解决一个贝尔曼公式:
    q π ( s , a ) = E [ R + γ q π ( S ′ , A ′ ) ∣ s , a ] , ∀ s , a q_\pi(s, a)=\mathbb{E}\left[R+\gamma q_\pi\left(S^{\prime}, A^{\prime}\right) \mid s, a\right], \quad \forall s, a qπ(s,a)=E[R+γqπ(S,A)s,a],s,a
    其中: R ∼ p ( R ∣ s , a ) , S ′ ∼ p ( S ′ ∣ s , a ) , A ′ ∼ π ( A ′ ∣ S ′ ) R \sim p(R \mid s, a), S^{\prime} \sim p\left(S^{\prime} \mid s, a\right), A^{\prime} \sim \pi\left(A^{\prime} \mid S^{\prime}\right) Rp(Rs,a),Sp(Ss,a),Aπ(AS)

  • 算法:
    q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ q t ( s t + 1 , a t + 1 ) ] ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma q_t\left(s_{t+1}, a_{t+1}\right)\right]\right] qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γqt(st+1,at+1)]]
    ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) : \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right): (st,at,rt+1,st+1,at+1):

    • 如果 ( s t , a t ) (s_t,a_t) (st,at)给定,则 r t + 1 r_{t+1} rt+1 s t + 1 s_{t+1} st+1不依赖于策略而是 p ( r ∣ s , a ) , p ( s ′ ∣ s , a ) p(r \mid s, a), p\left(s^{\prime} \mid s, a\right) p(rs,a),p(ss,a)

    • a t + 1 a_{t+1} at+1依据 π t ( s t + 1 ) \pi_{t}(s_t+1) πt(st+1) 不但是behavior policy而且是target policy

      在这里插入图片描述

MC 是 on-policy:

  • MC 目标是估计action value:

    q π ( s , a ) = E [ R t + 1 + γ R t + 2 + … ∣ S t = s , A t = a ] , ∀ s , a q_\pi(s, a)=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\ldots \mid S_t=s, A_t=a\right], \quad \forall s, a qπ(s,a)=E[Rt+1+γRt+2+St=s,At=a],s,a

  • 算法:
    q ( s , a ) ≈ r t + 1 + γ r t + 2 + … q(s, a) \approx r_{t+1}+\gamma r_{t+2}+\ldots q(s,a)rt+1+γrt+2+
    在这里插入图片描述

Q-learning 是 off-policy:

  • Q-learning 目标是求解贝尔曼最优公式:
    q ( s , a ) = E [ R t + 1 + γ max ⁡ a q ( S t + 1 , a ) ∣ S t = s , A t = a ] , ∀ s , a q(s, a)=\mathbb{E}\left[R_{t+1}+\gamma \max _a q\left(S_{t+1}, a\right) \mid S_t=s, A_t=a\right], \quad \forall s, a q(s,a)=E[Rt+1+γamaxq(St+1,a)St=s,At=a],s,a

  • 算法:
    q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max ⁡ a ∈ A q t ( s t + 1 , a ) ] ] q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _{a \in \mathcal{A}} q_t\left(s_{t+1}, a\right)\right]\right] qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γaAmaxqt(st+1,a)]]
    需要: ( s t , a t , r t + 1 , s t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}\right) (st,at,rt+1,st+1)

    • 如果 ( s t , a t ) \left(s_t, a_t\right) (st,at)给定则 r t + 1 r_{t+1} rt+1 s t + 1 s_{t+1} st+1不依赖策略而是由 p ( r ∣ s , a ) p ( s ′ ∣ s , a ) p(r \mid s, a) \quad p\left(s^{\prime} \mid s, a\right) p(rs,a)p(ss,a)这两个概率决定的

✨Q-learning伪代码(on-policy):

  • 对每个episode:

    • 如果当前的状态不是target state

      • 收集经验 ( s t , a t , r t + 1 , s t + 1 , a t + 1 ) \left(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\right) (st,at,rt+1,st+1,at+1)

      • 更新q-value: q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − [ r t + 1 + γ max ⁡ a q t ( s t + 1 , a ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _a q_t\left(s_{t+1}, a\right)\right]\right]\end{aligned} qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γamaxqt(st+1,a)]]

      • 更新policy:
        π t + 1 ( a ∣ s t ) = 1 − ϵ ∣ A ∣ ( ∣ A ∣ − 1 )  if  a = arg ⁡ max ⁡ a q t + 1 ( s t , a ) π t + 1 ( a ∣ s t ) = ϵ ∣ A ∣  otherwise  \begin{aligned} & \pi_{t+1}\left(a \mid s_t\right)=1-\frac{\epsilon}{|\mathcal{A}|}(|\mathcal{A}|-1) \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{t+1}\left(a \mid s_t\right)=\frac{\epsilon}{|\mathcal{A}|} \text { otherwise } \end{aligned} πt+1(ast)=1Aϵ(A1) if a=argamaxqt+1(st,a)πt+1(ast)=Aϵ otherwise 

✨Q-learning伪代码(off-policy):

  • 对于每个episode { s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , … } \left\{s_0, a_0, r_1, s_1, a_1, r_2, \ldots\right\} {s0,a0,r1,s1,a1,r2,}生成 π b \pi_b πb

    • 对episode每一步 t = 0 , 1 , 2 , … t=0,1,2, \ldots t=0,1,2,

    • 更新q-value:
      q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q ( s t , a t ) − [ r t + 1 + γ max ⁡ a q t ( s t + 1 , a ) ] ] \begin{aligned} & q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q\left(s_t, a_t\right)-\left[r_{t+1}+\gamma \max _a q_t\left(s_{t+1}, a\right)\right]\right] \end{aligned} qt+1(st,at)=qt(st,at)αt(st,at)[q(st,at)[rt+1+γamaxqt(st+1,a)]]

    • 更新target policy:
      π T , t + 1 ( a ∣ s t ) = 1  if  a = arg ⁡ max ⁡ a q t + 1 ( s t , a ) π T , t + 1 ( a ∣ s t ) = 0  otherwise  \begin{aligned} & \pi_{T, t+1}\left(a \mid s_t\right)=1 \text { if } a=\arg \max _a q_{t+1}\left(s_t, a\right) \\ & \pi_{T, t+1}\left(a \mid s_t\right)=0 \text { otherwise } \end{aligned} πT,t+1(ast)=1 if a=argamaxqt+1(st,a)πT,t+1(ast)=0 otherwise 

【比较】

q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − q ˉ t ] , q_{t+1}\left(s_t, a_t\right)=q_t\left(s_t, a_t\right)-\alpha_t\left(s_t, a_t\right)\left[q_t\left(s_t, a_t\right)-\bar{q}_t\right], qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)qˉt],

TD算法目标就是不断地接近于TD target( q ˉ t \bar{q}_t qˉt),如下所示不同算法的不同在于不同的 q ˉ t \bar{q}_t qˉt

在这里插入图片描述

这些算法实际上是解决贝尔曼公式或贝尔曼最优公式的随即逼近算法:

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值