强化学习笔记_5_TD-Learning

1.Sarsa算法

每次使用五元组 ( s t , a t , r t , s t + 1 , a t + 1 ) (s_t,a_t,r_t,s_{t+1},a_{t+1}) (st,at,rt,st+1,at+1)更新参数,即State-Action-Reward-State-Action (SARSA)

1.0.Derive TD Target

Discounted Return, R t R_t Rt depends on ( S t , A t , S t + 1 ) (S_t,A_t,S_{t+1}) (St,At,St+1)
U t = R t + γ ⋅ R t + 1 + γ 2 ⋅ R t + 2 ⋅ ⋅ ⋅ = R t + γ ⋅ ( R t + 1 + γ ⋅ R t + 2 ⋅ ⋅ ⋅ ) = R t + γ ⋅ U t + 1 \begin{aligned} U_t&=R_t+\gamma·R_{t+1}+\gamma^2·R_{t+2}··· \\&=R_t+\gamma·(R_{t+1}+\gamma·R_{t+2}···) \\&=R_t+\gamma·U_{t+1} \end{aligned} Ut=Rt+γRt+1+γ2Rt+2⋅⋅⋅=Rt+γ(Rt+1+γRt+2⋅⋅⋅)=Rt+γUt+1

Q π ( s t , a t ) = E [ U t ∣ s t , a t ] = E [ R t + γ ⋅ U t + 1 ∣ s t , a t ] = E [ R t ∣ s t , a t ] + γ ⋅ E [ U t + 1 ∣ s t , a t ] = E [ R t ∣ s t , a t ] + γ ⋅ E [ Q π ( S t + 1 , A t + 1 ) ∣ s t , a t ] \begin{aligned} Q_\pi(s_t,a_t)&=E[U_t|s_t,a_t] \\&=E[R_t+\gamma·U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[U_{t+1}|s_t,a_t] \\&=E[R_t|s_t,a_t]+\gamma·E[Q_\pi(S_{t+1},A_{t+1})|s_t,a_t] \end{aligned} Qπ(st,at)=E[Utst,at]=E[Rt+γUt+1st,at]=E[Rtst,at]+γE[Ut+1st,at]=E[Rtst,at]+γE[Qπ(St+1,At+1)st,at]

Identity: Q π ( s t , a t ) = E [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Rt+γQπ(St+1,At+1)], for all π \pi π.

蒙特卡洛近似: Q π ( s t , a t ) ≈ r t + γ Q π ( s t + 1 , a t + 1 ) = y t Q_\pi(s_t,a_t)\approx r_t+\gamma Q_\pi(s_{t+1},a_{t+1})=y_t Qπ(st,at)rt+γQπ(st+1,at+1)=yt

y t y_t yt为TD Target

1.1.Tabular Version

适用于规模较小、表格较小的问题,由状态和动作组成 Q Q Q表,使用Sarsa算法更新表格。

  • 观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
  • 根据策略 π ( ⋅ ∣ S t + 1 ) \pi(·|S_{t+1}) π(St+1)采样得到动作 a t + 1 a_{t+1} at+1
  • TD target: y t = r t + γ ⋅ Q π ( s t + 1 , a t + 1 ) y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1}) yt=rt+γQπ(st+1,at+1),其中 Q π ( s t + 1 , a t + 1 ) Q_\pi(s_{t+1},a_{t+1}) Qπ(st+1,at+1)查表得到
  • TD error: δ t = Q π ( s t , a t ) − y t \delta_t=Q_\pi(s_t,a_t)-y_t δt=Qπ(st,at)yt
  • 更新 Q Q Q表: Q π ( s t , a t ) ← Q π ( s t , a t ) − α ⋅ δ t Q_\pi(s_t,a_t)\leftarrow Q_\pi(s_t,a_t)-\alpha·\delta_t Qπ(st,at)Qπ(st,at)αδt
1.2.Neural Network Version

使用神经网络value network q ( s , a ; w ) q(s,a;w) q(s,a;w)近似计算 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a)

  • TD target: y t = r t + γ ⋅ q ( s t + 1 , a t + 1 ; w ) y_t=r_t+\gamma·q(s_{t+1},a_{t+1};w) yt=rt+γq(st+1,at+1;w)

  • TD error: δ t = q ( s t , a t ; w ) − y t \delta_t=q(s_t,a_t;w)-y_t δt=q(st,at;w)yt

  • Loss: δ t 2 / 2 \delta_t^2/2 δt2/2

  • Gradient: ∂ δ t 2 / 2 ∂ w = δ t ⋅ ∂ q ( s t , a t ; w ) ∂ w \frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial q(s_t,a_t;w)}{\partial w} wδt2/2=δtwq(st,at;w)

  • Gradient descent: w ← w − α ⋅ δ ⋅ ∂ q ( s t , a t ; w ) ∂ w w\leftarrow w-\alpha·\delta·\frac{\partial q(s_t,a_t;w)}{\partial w} wwαδwq(st,at;w)

2.Q-Learning

比较Q-Learning和Sarsa:

SarseQ-Learning
目标函数 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a) Q ∗ ( s , a ) Q^*(s,a) Q(s,a)
TD target y t = r t + γ ⋅ Q π ( s t + 1 , a t + 1 ) y_t=r_t+\gamma·Q_\pi(s_{t+1},a_{t+1}) yt=rt+γQπ(st+1,at+1) y t = r t + γ ⋅ max ⁡ a Q ∗ ( s t + 1 , a ) y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a) yt=rt+γmaxaQ(st+1,a)
参数更新value network; criticDQN
2.0.TD Target

在1.0中已经计算,对于策略 π \pi π
Q π ( s t , a t ) = E [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t)=E[R_t+\gamma·Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
对于最优策略optimal policy π ∗ \pi^* π
Q ∗ ( s t , a t ) = E [ R t + γ ⋅ Q ∗ ( S t + 1 , A t + 1 ) ] Q^*(s_t,a_t)=E[R_t+\gamma·Q^*(S_{t+1},A_{t+1})] Q(st,at)=E[Rt+γQ(St+1,At+1)]
取行动 A t + 1 A_{t+1} At+1 A t + 1 = arg ⁡ max ⁡ a Q ∗ ( S t + 1 , a ) A_{t+1}=\arg\max_a Q^*(S_{t+1},a) At+1=argmaxaQ(St+1,a)

Q ∗ ( S t + 1 , A t + 1 ) = max ⁡ a Q ∗ ( S t + 1 , a ) Q^*(S_{t+1},A_{t+1})=\max_a Q^*(S_t+1,a) Q(St+1,At+1)=maxaQ(St+1,a)
Q ∗ ( s t , a t ) = E [ R t + γ ⋅ max ⁡ a Q ∗ ( S t + 1 , a ) ] Q^*(s_t,a_t)=E[R_t+\gamma·\max_a Q^*(S_{t+1},a)] Q(st,at)=E[Rt+γamaxQ(St+1,a)]
使用蒙特卡洛近似,得到TD target y t y_t yt
Q ∗ ( s t , a t ) ≈ r t + γ ⋅ max ⁡ a Q ∗ ( s t + 1 , a ) = y t Q^*(s_t,a_t)\approx r_t+\gamma·\max_a Q^*(s_{t+1},a)=y_t Q(st,at)rt+γamaxQ(st+1,a)=yt

2.1.Tabular Version

适用于规模较小、表格较小的问题,由状态和动作组成 Q ∗ Q^* Q表,使用Q-Learning算法更新表格。

  • 观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1
  • TD target: y t = r t + γ ⋅ max ⁡ a Q ∗ ( s t + 1 , a ) y_t=r_t+\gamma·\max_a Q^*(s_{t+1},a) yt=rt+γmaxaQ(st+1,a),在 s t + 1 s_{t+1} st+1对应的行动中,找到表格值最大的一项
  • TD error: δ t = Q ∗ ( s t , a t ) − y t \delta_t=Q^*(s_t,a_t)-y_t δt=Q(st,at)yt
  • 更新 Q Q Q表: Q ∗ ( s t , a t ) ← Q ∗ ( s t , a t ) − α ⋅ δ t Q^*(s_t,a_t)\leftarrow Q^*(s_t,a_t)-\alpha·\delta_t Q(st,at)Q(st,at)αδt
2.2.DQN Version

使用DQN网络 Q ( s , a ; w ) Q(s,a;w) Q(s,a;w)近似计算 Q ∗ ( s , a ) Q^*(s,a) Q(s,a),控制agent执行行动 a t = arg ⁡ max ⁡ a Q ( s t , a ; w ) a_t=\arg\max_a Q(s_t,a;w) at=argmaxaQ(st,a;w)

可使用Q-Learning算法训练DQN:

  • 观测得到transition s t , a t , r t , s t + 1 s_t,a_t,r_t,s_{t+1} st,at,rt,st+1

  • TD target: y t = r t + γ ⋅ max ⁡ a Q ( s t + 1 , a ; w ) y_t=r_t+\gamma·\max_a Q(s_{t+1},a;w) yt=rt+γmaxaQ(st+1,a;w)

  • TD error: δ t = Q ( s t , a t ; w ) − y t \delta_t=Q(s_t,a_t;w)-y_t δt=Q(st,at;w)yt

  • Loss: δ t 2 / 2 \delta_t^2/2 δt2/2

  • Gradient: ∂ δ t 2 / 2 ∂ w = δ t ⋅ ∂ Q ( s t , a t ; w ) ∂ w \frac{\partial \delta^2_t/2}{\partial w}=\delta_t·\frac{\partial Q(s_t,a_t;w)}{\partial w} wδt2/2=δtwQ(st,at;w)

  • Gradient descent: w ← w − α ⋅ δ ⋅ ∂ Q ( s t , a t ; w ) ∂ w w\leftarrow w-\alpha·\delta·\frac{\partial Q(s_t,a_t;w)}{\partial w} wwαδwQ(st,at;w)

3.Multi-Step TD Target
3.0

之前的算法中,只使用了一步的Reward进行训练,如果使用多步的Reward,可以得到更好的效果。

image-20221018210512493

image-20221018210703026

3.1.Multi-Step Return

U t = R t + γ ⋅ U t + 1 U_t=R_t+\gamma·U_{t+1} Ut=Rt+γUt+1

对上式进行递归,得到:
U t = R t + γ ⋅ ( R t + 1 + γ ⋅ U t + 2 ) = R t + γ ⋅ R t + 1 + γ 2 ⋅ U t + 2 \begin{aligned} U_t&=R_t+\gamma·(R_{t+1}+\gamma·U_{t+2}) \\&=R_t+\gamma·R_{t+1}+\gamma^2·U_{t+2} \end{aligned} Ut=Rt+γ(Rt+1+γUt+2)=Rt+γRt+1+γ2Ut+2
继续递归:
U t = ∑ i = 0 m − 1 γ i ⋅ R t + i + γ m ⋅ U t + m U_t=\sum_{i=0}^{m-1}\gamma^i·R_{t+i}+\gamma^m·U_{t+m} Ut=i=0m1γiRt+i+γmUt+m

3.2.Multi-Step TD Target
  • m-step TD target for Sarsa:
    y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ Q π ( s t + m , a t + m ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·Q_\pi(s_{t+m},a_{t+m}) yt=i=0m1γirt+i+γmQπ(st+m,at+m)

  • m-step TD target for Q-Learning
    y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ max ⁡ a Q ∗ ( s t + m , a ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0m1γirt+i+γmamaxQ(st+m,a)
    s_{t+m},a_{t+m})
    $$

  • m-step TD target for Q-Learning
    y t = ∑ i = 0 m − 1 γ i ⋅ r t + i + γ m ⋅ max ⁡ a Q ∗ ( s t + m , a ) y_t=\sum_{i=0}^{m-1}\gamma^i·r_{t+i}+\gamma^m·\max_aQ^*(s_{t+m},a) yt=i=0m1γirt+i+γmamaxQ(st+m,a)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值