1.Policy Function Approximation
-
Policy Network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)
使用 π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)对策略函数 π ( a ∣ s ) \pi(a|s) π(a∣s)进行近似
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b9eBdiip-1672473083781)(null)]
使用Softmax可以满足 ∑ a ∈ A π ( a ∣ s ; θ ) = 1 \sum_{a\in\mathcal{A}}\pi(a|s;\theta)=1 ∑a∈Aπ(a∣s;θ)=1
2.State-Value Function Approximation
Actioni-Value function: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi(s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[Ut∣St=st,At=at]
State-Value function: V π ( s t ) = E A [ Q π ( s t , A ) ] V_\pi(s_t)=E_A[Q_\pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]
-
Policy-Based Reinforcement Learning
V π ( s t ) = E A [ Q π ( s t , A ) ] = ∑ a π ( a ∣ s t ) ⋅ Q π ( s t , a ) V_\pi(s_t)=E_A[Q_\pi(s_t,A)]=\sum_a\pi(a|s_t)·Q_\pi(s_t,a) Vπ(st)=EA[Qπ(st,A)]=a∑π(a∣st)⋅Qπ(st,a)
将策略函数 π ( a t ∣ s t ) \pi(a_t|s_t) π(at∣st)使用Policy Network进行近似后,状态价值函数可近似为
V π ( s t ; θ ) = ∑ a π ( a ∣ s t ; θ ) ⋅ Q π ( s t , a ) V_\pi(s_t;\theta)=\sum_a\pi(a|s_t;\theta)·Q_\pi(s_t,a) Vπ(st;θ)=a∑π(a∣st;θ)⋅Qπ(st,a)
学习目标:改进 θ \theta θ,使得 V π ( s ; θ ) V_\pi(s;\theta) Vπ(s;θ)更大,可将目标函数定义为
m a x i m i z e s J ( θ ) = E S [ V ( S ; θ ) ] maximizes \quad J(\theta)=E_S[V(S;\theta)] maximizesJ(θ)=ES[V(S;θ)]
参数更新:Policy gradient ascent 策略梯度上升- 观测状态 s s s
- 更新参数 θ \theta θ: θ ← θ + β ⋅ ∂ V ( s ; θ ) ∂ θ \theta\leftarrow \theta+\beta·\frac{\partial V(s;\theta)}{\partial\theta} θ←θ+β⋅∂θ∂V(s;θ)
3.Policy Gradient
V ( s ; θ ) = ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) V(s;\theta)=\sum_a\pi(a|s;\theta)·Q_\pi(s,a) V(s;θ)=a∑π(a∣s;θ)⋅Qπ(s,a)
∂ V ( s ; θ ) ∂ θ = ∂ ∑ a π ( a ∣ s ; θ ) ⋅ Q π ( s , a ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) = ∑ a π ( a ∣ s ; θ ) ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) = E A [ ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] \begin{aligned} \frac{\partial V(s;\theta)}{\partial\theta} &=\frac{\partial\sum_a\pi(a|s;\theta)·Q_\pi(s,a)}{\partial\theta} \\&=\sum_a\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) \\&=\sum_a\pi(a|s;\theta)\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a) \\&=E_A[\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a)] \end{aligned} ∂θ∂V(s;θ)=∂θ∂∑aπ(a∣s;θ)⋅Qπ(s,a)=a∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)=a∑π(a∣s;θ)∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)=EA[∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)]
(以上推导并不严谨,认为 Q π ( s , a ) Q_\pi(s,a) Qπ(s,a)与 θ \theta θ是无关的,但由于 π \pi π与 θ \theta θ有关,所以假设实际上不成立。但考虑与否的推导结果相同。)
得到Policy Gradient的两种计算方法
-
方法1:
∂ V ( s ; θ ) ∂ θ = ∑ a ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) \frac{\partial V(s;\theta)}{\partial\theta}=\sum_a\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) ∂θ∂V(s;θ)=a∑∂θ∂π(a∣s;θ)⋅Qπ(s,a)
对于离散动作,对所有动作计算 f ( a , θ ) = ∂ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) f(a,\theta)=\frac{\partial\pi(a|s;\theta)}{\partial\theta}·Q_\pi(s,a) f(a,θ)=∂θ∂π(a∣s;θ)⋅Qπ(s,a),然后累加 -
方法2:
∂ V ( s ; θ ) ∂ θ = E A [ ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ) ] \frac{\partial V(s;\theta)}{\partial\theta} =E_A[\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}·Q_\pi(s,a)] ∂θ∂V(s;θ)=EA[∂θ∂log(π(a∣s;θ))⋅Qπ(s,a)]
可以用于连续或离散动作,通过积分的方法计算期望,但是由于 π \pi π是通过神经网络计算的,无法直接计算积分,故通过蒙特卡洛近似的方法计算:- 根据当前预测策略 π ( ⋅ ∣ s ; θ ) \pi(·|s;\theta) π(⋅∣s;θ),在动作空间内随机采样,得到动作 a ^ \hat{a} a^
- 计算 g ( a ^ , θ ) = ∂ log ( π ( a ^ ∣ s ; θ ) ) ∂ θ ⋅ Q π ( s , a ^ ) g(\hat{a},\theta)=\frac{\partial \log(\pi(\hat{a}|s;\theta))}{\partial\theta}·Q_\pi(s,\hat{a}) g(a^,θ)=∂θ∂log(π(a^∣s;θ))⋅Qπ(s,a^)
- g ( a ^ , θ ) g(\hat{a},\theta) g(a^,θ)是对 ∂ V ( s ; θ ) ∂ θ \frac{\partial V(s;\theta)}{\partial\theta} ∂θ∂V(s;θ)的无偏估计,将其作为策略梯度的近似值
4.Update policy network using policy gradient
- Observe the state s t s_t st
- Randomly sample action a t a_t at according to π ( ⋅ ∣ s t ; θ t ) \pi(·|s_t;\theta_t) π(⋅∣st;θt)
- Computer q t ≈ Q π ( s t , a t ) q_t\approx Q_\pi(s_t,a_t) qt≈Qπ(st,at)
- Differentiate policy network: d θ , t = ∂ log ( π ( a ∣ s ; θ ) ) ∂ θ ∣ θ = θ t d_{\theta,t}=\frac{\partial \log(\pi(a|s;\theta))}{\partial\theta}|_{\theta=\theta_t} dθ,t=∂θ∂log(π(a∣s;θ))∣θ=θt
- (Approximate) policy gradient: g ( a t , θ t ) = q t ⋅ d θ , t g(a_t,\theta_t)=q_t·d_{\theta,t} g(at,θt)=qt⋅dθ,t
- Update policy network: θ ← θ + β ⋅ g ( a t , θ t ) \theta\leftarrow \theta+\beta·g(a_t,\theta_t) θ←θ+β⋅g(at,θt)
5. q t = Q π ( s t , a t ) q_t=Q_\pi(s_t,a_t) qt=Qπ(st,at)的计算
-
方法1:REINFORCE
完成一个完整过程,得到序列
s 1 , a 1 , r 1 , ⋅ ⋅ ⋅ , s T , a T , r T s_1,a_1,r_1,···,s_T,a_T,r_T s1,a1,r1,⋅⋅⋅,sT,aT,rT
计算Return u t = ∑ k = t T γ k − t r k u_t=\sum_{k=t}^T\gamma^{k-t}r_k ut=∑k=tTγk−trk,由于 Q π ( s t , a t ) = E [ U t ] Q_\pi(s_t,a_t)=E[U_t] Qπ(st,at)=E[Ut],故可以使用 u t u_t ut近似 Q π ( s t , a t ) Q_\pi(s_t,a_t) Qπ(st,at),即
q t = u t q_t=u_t qt=ut -
方法2:使用神经网络计算,actor-critic method
故可以使用
u
t
u_t
ut近似
Q
π
(
s
t
,
a
t
)
Q_\pi(s_t,a_t)
Qπ(st,at),即
q
t
=
u
t
q_t=u_t
qt=ut
- 方法2:使用神经网络计算,actor-critic method