强化学习—— TD算法(Sarsa算法+Q-learning算法)

在这里插入图片描述

1. Sarsa算法

1.1 TD Target

  • 回报函数的定义为:
    U t = R t + γ R t + 1 + γ 2 R t + 2 + ⋅ ⋅ ⋅ U t = R t + γ ( R t + 1 + γ R t + 2 + ⋅ ⋅ ⋅ ) U t = R t + γ U t + 1 U_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\cdot \cdot \cdot\\ U_t=R_t+\gamma (R_{t+1}+\gamma R_{t+2}+\cdot \cdot \cdot)\\ U_t = R_t+\gamma U_{t+1} Ut=Rt+γRt+1+γ2Rt+2+Ut=Rt+γ(Rt+1+γRt+2+)Ut=Rt+γUt+1
  • 假设t时刻的回报依赖于t时刻的状态、动作以及t+1时刻的状态: R t ← ( S t , A t , S t + 1 ) R_t \gets (S_t,A_t,S_{t+1}) Rt(St,At,St+1)
  • 则动作价值函数可以定义为: Q π ( s t , a t ) = E [ U t ∣ a t , s t ] Q π ( s t , a t ) = E [ R t + γ U t + 1 ∣ a t , s t ] Q π ( s t , a t ) = E [ R t ∣ a t , s t ] + γ E [ U t + 1 ∣ a t , s t ] Q π ( s t , a t ) = E [ R t ∣ a t , s t ] + γ E [ Q π ( S t + 1 , A t + 1 ) ∣ a t , s t ] Q π ( s t , a t ) = E [ R t + γ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t)=E[U_t|a_t,s_t]\\ Q_\pi(s_t,a_t)=E[R_t+\gamma U_{t+1}|a_t,s_t]\\Q_\pi(s_t,a_t)=E[R_t|a_t,s_t]+\gamma E[U_{t+1}|a_t,s_t]\\ Q_\pi(s_t,a_t)=E[R_t|a_t,s_t]+\gamma E[Q_\pi(S_{t+1},A_{t+1})|a_t,s_t]\\ Q_\pi(s_t,a_t) = E[R_t + \gamma Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Utat,st]Qπ(st,at)=E[Rt+γUt+1at,st]Qπ(st,at)=E[Rtat,st]+γE[Ut+1at,st]Qπ(st,at)=E[Rtat,st]+γE[Qπ(St+1,At+1)at,st]Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
  • 依据蒙特卡洛近似: y t = r t + γ Q π ( s t + 1 , a t + 1 ) y_t= r_t + \gamma Q_\pi(s_{t+1},a_{t+1}) yt=rt+γQπ(st+1,at+1)
  • TD学习的目标: y t ≈ Q π ( s t , a t ) y_t \approx Q_\pi(s_t,a_t) ytQπ(st,at)

1.2 表格形式的Sarsa算法

  • 学习动作价值函数 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a)
  • 假设动作和状态的数量有限。
  • 则需要学习下列表格信息:
S\A a 1 a_1 a1 a 2 a_2 a2 a 3 a_3 a3 a 4 a_4 a4
s 1 s_1 s1 Q 11 Q_{11} Q11
s 2 s_2 s2
s 3 s_3 s3
s 4 s_4 s4

计算步骤为:

  1. 观测到一个transition,即: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
  2. 依据策略函函数对动作进行抽样: a t + 1 ∼ π ( ⋅ ∣ s t + 1 ) a_{t+1}\sim \pi(\cdot|s_{t+1}) at+1π(st+1)
  3. 查表得到TD Target: y t = r t + γ Q π ( s t + 1 , a t + 1 ) y_t = r_t+\gamma Q_\pi(s_{t+1},a_{t+1}) yt=rt+γQπ(st+1,at+1)
  4. TD error为: δ t = Q π ( s t , a t ) − y t \delta_t=Q_\pi(s_t,a_t)-y_t δt=Qπ(st,at)yt
  5. 更新表格: Q π ( s t , a t ) ← Q π ( s t , a t ) − α ⋅ δ t Q_\pi(s_t,a_t)\gets Q_\pi(s_t,a_t) - \alpha \cdot \delta_t Qπ(st,at)Qπ(st,at)αδt

1.3 神经网络形式的Sarsa算法

  • 用神经网络近似动作价值函数: q ( s , q ; W ) ∼ Q π ( s , a ) q(s,q;W)\sim Q_\pi(s,a) q(s,q;W)Qπ(s,a)
  • 神经网络作为裁判去评判动作
  • 参数W需要学习
  • TD Target为: y t = r t + γ ⋅ q ( s t + 1 , a t + 1 ; W ) y_t = r_t+\gamma \cdot q(s_{t+1},a_{t+1};W) yt=rt+γq(st+1,at+1;W)
  • TD error为: δ t = q ( s t , a t ; W ) − y t \delta_t = q(s_t,a_t;W)-y_t δt=q(st,at;W)yt
  • loss 为: 1 2 ⋅ δ t 2 \frac{1}{2}\cdot \delta_t^2 21δt2
  • 梯度为: δ t ⋅ ∂ q ( s t , a t ; W ) ∂ W \delta_t \cdot \frac{\partial q(s_t,a_t;W)}{\partial W} δtWq(st,at;W)
  • 进行梯度下降: W ← W − α ⋅ δ t ⋅ ∂ q ( s t , a t ; W ) ∂ W W\gets W - \alpha \cdot \delta_t \cdot \frac{\partial q(s_t,a_t;W)}{\partial W} WWαδtWq(st,at;W)

2. Q-learning算法

Q-learning用来学习最优动作价值函数: Q π ⋆ ( s , a ) Q_\pi^\star (s,a) Qπ(s,a)

2.1 TD Target

Q π ( s t , a t ) = E [ R t + γ ⋅ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t) = E[R_t+\gamma \cdot Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
将最优策略函数计为: π ⋆ \pi^\star π
则: Q ⋆ ( s t , a t ) = Q π ⋆ ( s t , a t ) = E [ R t + γ ⋅ Q π ⋆ ( S t + 1 , A t + 1 ) ] Q^\star(s_t,a_t)=Q_{\pi^\star}(s_t,a_t)= E[R_t+\gamma \cdot Q_{\pi^\star}(S_{t+1},A_{t+1})] Q(st,at)=Qπ(st,at)=E[Rt+γQπ(St+1,At+1)]
t+1时刻的动作按下式进行计算: A t + 1 = a r g m a x a Q ⋆ ( s t + 1 , a ) A_{t+1}=\mathop{argmax}\limits_{a} Q^\star (s_{t+1},a) At+1=aargmaxQ(st+1,a)
则最优动作价值函数可作如下近似: Q ⋆ ( s t , a t ) = E [ R t + γ ⋅ m a x a Q ⋆ ( S t + 1 , a ) ] ≈ r t + m a x a Q ⋆ ( s t + 1 , a ) Q^\star(s_t,a_t)=E[R_t+\gamma \cdot \mathop{max}\limits_{a}Q^\star(S_{t+1},a)]\\ \approx r_t+\mathop{max}\limits_{a}Q^\star(s_{t+1},a) Q(st,at)=E[Rt+γamaxQ(St+1,a)]rt+amaxQ(st+1,a)

2.2 表格形式的Q-learning算法

S\A a 1 a_1 a1 a 2 a_2 a2 a 3 a_3 a3 a 4 a_4 a4
s 1 ( 找 出 此 行 最 大 的 Q ) s_1(找出此行最大的Q) s1(Q) Q 11 Q_{11} Q11
s 2 s_2 s2
s 3 s_3 s3
s 4 s_4 s4

计算步骤为:

  1. 观测到一个transition,即: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
  2. TD Target为: y t = r t + m a x a Q ⋆ ( s t + 1 , a ) y_t=r_t+\mathop{max}\limits_{a}Q^\star(s_{t+1},a) yt=rt+amaxQ(st+1,a)
  3. TD error为: δ t = Q ⋆ ( s t , a t ) − y t \delta_t=Q^\star(s_t,a_t)-y_t δt=Q(st,at)yt
  4. 更新表格: Q ⋆ ( s t , a t ) ← Q ⋆ ( s t , a t ) − α ⋅ δ t Q^\star(s_t,a_t)\gets Q^\star(s_t,a_t) - \alpha \cdot \delta_t Q(st,at)Q(st,at)αδt

2.3 神经网络形式的Q-learning算法(DQN)

  1. 观测到一个transition,即: ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
  2. TD Target为: y t = r t + m a x a Q ( s t + 1 , a ; W ) y_t=r_t+\mathop{max}\limits_{a}Q(s_{t+1},a;W) yt=rt+amaxQ(st+1,aW)
  3. TD error为: δ t = Q ( s t , a t ; W ) − y t \delta_t=Q(s_{t},a_t;W)-y_t δt=Q(st,atW)yt
  4. 参数更新: W ← W − α ⋅ δ t ⋅ ∂ Q ( s t , a t ; W ) ∂ W W\gets W - \alpha \cdot \delta_t \cdot \frac{\partial Q(s_t,a_t;W)}{\partial W} WWαδtWQ(st,at;W)

3. Saras和Q-learning的区别

  1. Sarsa学习动作价值函数: Q π ( s , a ) Q_\pi(s,a) Qπ(s,a)
  2. Actor-Critic中的价值网络为用Sarsa训练的
  3. Q-learning训练最优动作价值函数: Q ⋆ ( s , a ) Q^\star(s,a) Q(s,a)

4. Multi-step TD Target

  • one-step仅使用一个reward: r t r_t rt
  • multi-step 使用m个reward: r t , r t + 1 , . . . , t t + m − 1 r_t,r_{t+1},...,t_{t+m-1} rt,rt+1,...,tt+m1

4.1 Sarsa的Multi-step TD Target

y t = ∑ i = 0 m − 1 λ i r t + i + λ m Q π ( s t + m , a t + m ) y_t = \sum_{i=0}^{m-1}\lambda^i r_{t+i} + \lambda^mQ_\pi(s_{t+m},a_{t+m}) yt=i=0m1λirt+i+λmQπ(st+m,at+m)

4.2 Q-learning的Multi-step TD Target

y t = ∑ i = 0 m − 1 λ i r t + i + λ m m a x a Q ⋆ ( s t + m , a ) y_t = \sum_{i=0}^{m-1}\lambda^i r_{t+i} + \lambda^m\mathop{max}\limits_{a}Q^\star(s_{t+m},a) yt=i=0m1λirt+i+λmamaxQ(st+m,a)
本文为参考B站学习视频书写的笔记!
by CyrusMay 2022 04 08

我们在小孩和大人的转角
盖一座城堡
——————五月天(好好)——————

  • 7
    点赞
  • 42
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值