强化学习笔记_3_策略学习_Policy-Based Reinforcement Learning

1.Policy Function Approximation
  • Policy Network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)

    使用 π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)对策略函数 π ( a ∣ s ) \pi(a|s) π(as)进行近似

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b9eBdiip-1672473083781)(null)]

    使用Softmax可以满足 ∑ a ∈ A π ( a ∣ s ; θ ) = 1 \sum_{a\in\mathcal{A}}\pi(a|s;\theta)=1 aAπ(as;θ)=1

2.State-Value Function Approximation

Actioni-Value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi(s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[UtSt=st,At=at]

State-Value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_\pi(s_t)=E_A[Q_\pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]

  • Policy-Based Reinforcement Learning
    V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_\pi(s_t)=E_A[Q_\pi(s_t,A)]=\sum_a\pi(a|s_t)·Q_\pi(s_t,a) Vπ(st)=EA[Qπ(st,A)]=aπ(ast)Qπ(st,a)
    将策略函数 π ( a t ∣ s t ) \pi(a_t|s_t) π(atst)使用Policy Network进行近似后,状态价值函数可近似为
    V π ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V_\pi(s_t;\theta)=\sum_a\pi(a|s_t;\theta)·Q_\pi(s_t,a) Vπ(st;θ)=aπ(ast;θ)Qπ(st,a)
    学习目标:改进 θ \theta θ,使得 V π ( s ; θ ) V_\pi(s;\theta) Vπ(s;θ)更大,可将目标函数定义为
    m a x i m i z e s J ( θ ) = E S [ V ( S ; θ ) ] maximizes \quad J(\theta)=E_S[V(S;\theta)] maximizesJ(θ)=ES[V(S;θ)]
    参数更新:Policy gradient ascent 策略梯度上升

    • 观测状态 s s s
    • 更新参数 θ \theta θ θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ \theta\leftarrow \theta+\beta·\frac{\partial V(s;\theta)}{\partial\theta} θθ+βθV(s;θ)
3.Policy Gradient

V ( s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V(s;\theta)=\sum_a\pi(a|s;\theta)·Q_\pi(s,a) V(s;θ)=aπ(as;θ)Qπ(s,a)

∂ V ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) = ∑ a π ( a ∣ s ; θ ) ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) = E A [ ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] \begin{aligned} \frac{\partial V(s;\theta)}{\partial\theta} &=\frac{\partial\sum_a\pi(a|s;\theta)·Q_\pi(s,a)}{\partial\theta} \\&=\sum_a\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) \\&=\sum_a\pi(a|s;\theta)\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a) \\&=E_A[\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a)] \end{aligned} θV(s;θ)=θaπ(as;θ)Qπ(s,a)=aθπ(as;θ)Qπ(s,a)=aπ(as;θ)θlog(π(as;θ))Qπ(s,a)=EA[θlog(π(as;θ))Qπ(s,a)]

(以上推导并不严谨,认为 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a) θ \theta θ是无关的,但由于 π \pi π θ \theta θ有关,所以假设实际上不成立。但考虑与否的推导结果相同。)

得到Policy Gradient的两种计算方法

  • 方法1:
    ∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) \frac{\partial V(s;\theta)}{\partial\theta}=\sum_a\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) θV(s;θ)=aθπ(as;θ)Qπ(s,a)
    对于离散动作,对所有动作计算 f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) f(a,\theta)=\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) f(a,θ)=θπ(as;θ)Qπ(s,a),然后累加

  • 方法2:
    ∂ V ( s ; θ ) ∂ θ = E A [ ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] \frac{\partial V(s;\theta)}{\partial\theta} =E_A[\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a)] θV(s;θ)=EA[θlog(π(as;θ))Qπ(s,a)]
    可以用于连续或离散动作,通过积分的方法计算期望,但是由于 π \pi π是通过神经网络计算的,无法直接计算积分,故通过蒙特卡洛近似的方法计算:

    • 根据当前预测策略 π ( ⋅ ∣ s ; θ ) \pi(·|s;\theta) π(s;θ),在动作空间内随机采样,得到动作 a ^ \hat{a} a^
    • 计算 g ( a ^ , θ ) = ∂ log ⁡ ( π ( a ^ ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ^ ) g(\hat{a},\theta)=\frac{\partial \log(\pi(\hat{a}|s;\theta))}{\partial\theta}·Q_\pi(s,\hat{a}) g(a^,θ)=θlog(π(a^s;θ))Qπ(s,a^)
    • g ( a ^ , θ ) g(\hat{a},\theta) g(a^,θ)是对 ∂ V ( s ; θ ) ∂ θ \frac{\partial V(s;\theta)}{\partial\theta} θV(s;θ)的无偏估计,将其作为策略梯度的近似值
4.Update policy network using policy gradient
  • Observe the state s t s_t st
  • Randomly sample action a t a_t at according to π ( ⋅ ∣ s t ; θ t ) \pi(·|s_t;\theta_t) π(st;θt)
  • Computer q t ≈ Q π ( s t , a t ) q_t\approx Q_\pi(s_t,a_t) qtQπ(st,at)
  • Differentiate policy network: d θ , t = ∂ log ⁡ ( π ( a ∣ s ; θ ) ) ∂ θ ∣ θ = θ t d_{\theta,t}=\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}|_{\theta=\theta_t} dθ,t=θlog(π(as;θ))θ=θt
  • (Approximate) policy gradient: g ( a t , θ t ) = q t ⋅ d θ , t g(a_t,\theta_t)=q_t·d_{\theta,t} g(at,θt)=qtdθ,t
  • Update policy network: θ ← θ + β ⋅ g ( a t , θ t ) \theta\leftarrow \theta+\beta·g(a_t,\theta_t) θθ+βg(at,θt)
5. q t = Q π ( s t , a t ) q_t=Q_\pi(s_t,a_t) qt=Qπ(st,at)的计算
  • 方法1:REINFORCE

    完成一个完整过程,得到序列
    s 1 , a 1 , r 1 , ⋅ ⋅ ⋅ , s T , a T , r T s_1,a_1,r_1,···,s_T,a_T,r_T s1,a1,r1,⋅⋅⋅,sT,aT,rT
    计算Return u t = ∑ k = t T γ k − t r k u_t=\sum_{k=t}^T\gamma^{k-t}r_k ut=k=tTγktrk,由于 Q π ( s t , a t ) = E [ U t ] Q_\pi(s_t,a_t)=E[U_t] Qπ(st,at)=E[Ut],故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_\pi(s_t,a_t) Qπ(st,at),即
    q t = u t q_t=u_t qt=ut

  • 方法2:使用神经网络计算,actor-critic method

故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_\pi(s_t,a_t) Qπ(st,at),即
q t = u t q_t=u_t qt=ut

  • 方法2:使用神经网络计算,actor-critic method
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值